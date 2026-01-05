Artificial Intelligence, Digital Scholarship, Duke researchers, Instruction, Library Hacks

It’s 2026. Why Are LLMs Still Hallucinating?

Way back in spring 2023, we wrote about the emergence of ChatGPT on Duke’s campus. The magical tool that could help, “Write papers! Debug Code! Create websites from thin air! Do your laundry!” By 2026, AI can do most of those things (well…maybe not your laundry). But one problem we highlighted back then persists today: LLMs still make stuff up.

When I talk to Duke students, many describe first-hand encounters with AI hallucinations – plausible sounding, but factually incorrect AI-generated info. A 2025 research study of Duke students found that 94% believe Generative AI’s accuracy varies significantly across subjects, and 90% want clearer transparency about an AI tool’s limitations. Yet despite these concerns, 80% still expect AI to  personalize their own learning within the next five years. Students feel they don’t want to throw the baby out with the bathwater when these tools can break down complex topics, summarize dense course readings, or turn a messy pile of class notes into a coherent study outline. This tension between AI’s usefulness and its unreliability raises an obvious question: if the newest “reasoning models” are smarter and more precise, why do hallucinations persist?

 Below are four core reasons.

1. Benchmark tests for LLMs favor guessing over IDK

You’ve probably seen the headlines: the latest version of [insert AI chatbot here] has aced the MCAT, crushed the LSAT, and can performSimpson character, Milhouse, takes a test in a classroom full of other children taking tests. He falls off chair and the janitor sweeps him away. PhD-level reasoning tasks. Impressive as this sounds, many of the benchmark evaluation tests for LLMs reward guessing over acknowledging uncertainty – as explained in Open AI’s post, Why Language Models Hallucinate. This leads to the question: why can’t AI companies just design models that say “I don’t know”? The short answer is that today’s LLMs are trained to produce the most statistically likely answer, not to assess their own confidence. Without an evaluation system that rewards saying “I don’t know” models will default to guessing. But even if we fix the benchmarks, another problem remains: the quality of the information LLMs train on is often pretty bad.

2. Training data for LLMs is riddled with inaccuracies, half-truths, and opinions

The principle of GIGO (Garbage In, Garbage Out) is critical toA cartoon dumpster has flames coming out of it and floats down a flooded waterway passing a sign that reads, what is even happening? understanding the hallucination problem. LLMs perform well when a fact appears frequently and consistently in its training data. For example, because the capital of Peru (=Lima) is widely documented, an LLM can reliably reproduce that fact. Hallucinations arise when the data is more sparse, contradictory, or low-quality. Even if we could minimize hallucination, we’d be relying on the assumption that the underlying training data is trustworthy. And remember: LLMs are trained on vast swaths of the open web. Reddit threads, YouTube conspiracy videos, hot-takes on personal blogs, and evidence-based academic sources all sit side-by-side in the training data. The LLM doesn’t inherently know which sources are credible. So if a false claim appears often enough (ex. The Apollo moon landing was a hoax!) – an LLM might confidently repeat it, even though the claim has been thoroughly debunked.

3. LLMs aim to please (because that’s what we want them to do)

When ChatGPT-4o launched, OpenAI was quickly criticized for theA well-dressed man played by Leonardo DiCaprio in a tuxedo raises a champagne glass and smirks in a celebratory toast. model’s unusually high level of sycophancy. AI sycophancy being the tendency an LLM has to validate and praise users even when their ideas are pretty ridiculous (like the now-famous soggy cereal cafe concept). OpenAI dialed back the sycophancy, but the incident revealed something fundamental: LLMs tell us what we want to hear. Because these systems learn from human feedback, they’re reinforced to sound helpful, friendly, and affirming. They’ve learned that people prefer a “digital Yes Man.” After all, if ChatGPT wasn’t so validating, would you really keep coming back? Probably not. This tension between the very behaviors that make them fun to use can make them overconfident, over-agreeable, and more prone to innacuracies or hallucination. 

4. Human language (and how we use it) is complicated

LLMs are excellent at parsing syntax and analyzing semantics, but human communication requires much more than grammar. InA woman wearing a construction worker helmet nods her head affirmatively and the text reads, It's complicated. linguistics, the concept of pragmatics refers to how context, intention, tone, background knowledge, and social norms shape meaning. This is where LLMs struggle. They don’t truly understand implied meanings, sarcasm, emotional nuance or unspoken assumptions. LLMs use math (or statistical pattern matching) to predict the probable next word or idea. When that educated guess doesn’t align with the intended meaning, hallucinations may be more likely to occur.

Example to illustrate how linguistic meaning and literal meaning could be challenging for an LLM to interpret:

A grachic that displays a large group of male deers (bucks) and a caption that reads, How an LLM might interpret: She'd pay 1 million male deers to acquire a piece of pizza. A girl in a hooded sweatshirt stands in the middle of the graphic with a speech bubble that reads, "What I wouldn't do for a slice of pizzar right now. I'd pay a MILLION bucks." To the far right a slice of cartoon pepperoni pizza with a caption that reads, what the human means: I'm super hungry right now. I'd love some pizza.

TL; DR – So … why are LLMs still hallucinating? 

  • They’re evaluated using benchmarks that reward confident answers over accurate ones.
  • They’re trained on internet data full of contradictions, misinformation, and opinions.
  • They’re reinforced (by humans) to be friendly and engaging – sometimes to a fault.
  • They still can’t grasp the contextual, messy nature of human language.

AI will keep improving and getting better, but trustworthiness isn’t just a technial problem – it’s a design, data, and human-behavior problem. By understanding how LLMs work, staying critically aware of their limitations, and double-checking anything that seems off, you’ll strengthen your AI fluency and make smarter use of the technology.

Want to boost your AI fluency? Check out these Duke resources: 

Special thanks to Brinnae Bent, Mary Osborne, and Aaron Welborn for reviewing the post!

One thought on “It’s 2026. Why Are LLMs Still Hallucinating?”

  1. The tension you describe between students’ reliance on AI and their distrust ofLLM Hallucinations in 2026 its accuracy feels spot-on. What strikes me most is how much these hallucinations stem from the way we test and reward models—benchmarks push them to guess confidently instead of acknowledging uncertainty. It seems like the real progress will come not just from smarter models, but from reshaping the incentives we use to measure them.

