Why do these small models all rank so bad in hallucination? Incl. Gemma 4.
Posted by Fusseldieb@reddit | LocalLLaMA | View on Reddit | 23 comments
A few days ago Gemma 4 came out, and while they race against every other "intelligence" benchmark, the one that probably matters the most, they don't race against, which is the (Non-)Hallucinate Rate.
Are these small models bad regardless of training (ie. architrectural-wise)?
In my book a model is quite "useless" when it hallucinates so much, which would mean that if it doesn't find something in it's RAG, it might respond nonsense roughly 80% of the time?
Someone please prove me wrong.
Southern_Sun_2106@reddit
In my extensive testing with 25-30K prompts GLM 4.5 Air was the least hallucinating model (better than Qwen3.5 27B). In my book - hallucinating = making up stuff when asked to be faithful to the available context. Smaller models in my personal experience do tend to hallucinate more; at the same time, there are small models like the ol' legendary mistral 7b that sticks to context better than larger more modern models. So, it is not necessarily size-dependent. It is just smaller models are usually weaker, and that's why they hallucinate more; but it is the weakness of the model that determines hallucination rates. To address your other point, 'intelligence', I am of the same opinion here - it doesn't matter how 'intelligent' the model is according to whatever benchmarks. If it hallucinates, it is not reliable (unless one uses it to write beautiful poetry where faithfulness to context is not needed).
ambient_temp_xeno@reddit
Ironically the paper where they introduced this found that model size wasn't the main factor for the models of the time.
You have to wonder how they know for sure that Grok (for example) isn't using all kinds of tools and just claiming it isn't/isn't aware of it.
Far-Low-4705@reddit
If you look at other models on this benchmark, random small models will do really good on it for no reason.
Llama and mistral small 3.2 did extremely well compared to other similarly sized models, but they were both super dumb compared to all other models (and in my experience even worse than other modern reasoning models (which scored even worse on this benchmark?))
Fusseldieb@reddit (OP)
I'm pretty sure Grok uses grounding heavily, even when relying on it via API. I can't imagine this big of a jump out of nowhere.
tobias_681@reddit
Minimax went from 11 % to 66 % and GLM from 10 % to 66 % between M2.5 and M2.7 and GLM4.7 and GLM5 respectively. Now I wouldn't hold it past Xai to do shady stuff but I doubt that they do it here.
Basically Omniscience Accuracy and Non-Hallucination Rate tend to be inverse of each other. If you lover the Hallucination (better score in this benchmark) you also tend to lower the overall accuracy. Grok 4 went from being an average hallucinating model (similar to Opus) to 4.2 being one of the top performers in non-hallucinating.
Meanwhile it dropped in the Omniscience Accuracy benchmark from 41 % (better than Sonnet, better than any open weight) to 29 %, worse than multiple of the open weight models. So this is the price they paid. GLM5 and Minimax M2.7 also perform worse than the predecessor on the accuracy benchmark but the difference is much smaller. GLM5 actually performs worse also despise being a bigger model than GLM4.7.
All of this infers that Grok is likely a larger model than any open weight model, similar in size to GPT, Claude and Gemini but they just really doubled down extra hard on making it not hallucinate which in turn leads to it refusing questions it would actually know the answer too.
Overall Grok 4.2 did increase on the Omniscience Index score though which is supposed to be a balance between getting the right answer and not hallucinating (it rewards right answers and penalizes hallucinations).
It is interesting that Minimax achieved such a good hallucination rate while still getting a good Omniscience Accuracy score for its model size though. The cost in overall knowledge for less hallucinations seems to have been extremely marginal for Minimax. That is unusual compared to other models.
Double_Cause4609@reddit
Tbh, I'd actually rather have a model that rarely hallucinated than max performance.
I find that the odds of hallucination tend to compound over lots of responses, and it's really hard to work with models reliably.
Rim_smokey@reddit
A quick look at the data gives you the answer:
Amount of parameters matters a lot
Thinking helps
Exciting_Garden2535@reddit
But on this data gpt-5.4 has only 11%, worse than gemma4-26b. Do you think gpt-5.4 even smaller? :)
tobias_681@reddit
This isn't the answer.
Gemma 4 E4B scores 69 % on the non-hallucination rate. Gemma 4 2EB scores 67 %. Qwen3.5 0.8B scores 63 %. All with thinking. Gemma 4 E4B and E2B also score high without thinking (between 40-50 %)
It likely depends on how the model is trained and Google Deepmind decided to train the small Gemma models to avoid hallucinations but not the larger ones. Generally training for hallucination avoidance probably degrades performance as the model refuses tasks that it could have done - as evident by Qwen models often scoring better on Omniscience Accuracy without thinking. In the thinking eventually they come to the conclusion that they don't really know this even though in some cases they would have had the right answer if they just one-shotted it without thinking.
This is likely Deepmind optimizing for different usecases. They do not want the small Gemma on device models to hallucinate all kinds of wild shit, better avoid answering. However they want the maximum performance from the larger models to run on more serious hardware. I figure they dont think you run Gemma 31B locally to ask it for medical or legal advice. With the phone models there is a much higher risk. It may also generally be a strategy to keep the small models at bay. Qwen3.5 0.8B likely makes the model specialize in tool calling and trains it to refuse stuff that is obviously outside its scope.
Amount of parameters doesn't seem to matter at all for this benchmark. More parameters means it will tend to score better on Omniscience Accuracy (percent of right answers) but Non Hallucination rate tests what it does on the questions that it doesn't answer right. Often labs train there smaller models to refuse more. You can see that with Anthropic, Haiku hallucinates the least on questions it can't answer, Opus the most. This is likely because they intentionally give Opus the least guardrails to maximize its performance. GPT is sort of the same. nano scores the best here (normal and mini almost identical). With Google its a bit different.
GPT-5.4 (xhigh) scores 11 % on this benchmark. Is it secretly a tiny 3B model? I doubt it.
Clear-Ad-9312@reddit
People saying paremeter count matters / size matter, but even GPT-5.4 (xhigh) is there at 11%
I am starting to think it is mostly the system prompt and training data or post-training/reinforcement training that is mostly the reason why.
GPT is quite reliable at just spitting out whatever it thinks you want to hear, I think sycophancy is a big issue, but not sure if that is the contributor.
Infamous-Art7156@reddit
Worth stepping back before drawing conclusions from this leaderboard, because it actually can't answer the question being asked.
Look closely at the AA-Omniscience hallucination rate chart — almost every single model on it has the lightbulb icon, meaning it's running in reasoning mode. There are barely any non-reasoning models in the comparison group at all. That means you can't use this leaderboard to conclude anything about small vs. large models, or even reasoning vs. non-reasoning — the sample is too homogeneous.
What the chart *does* show is that hallucination rate varies enormously (22% to 94%) *within* reasoning models. So whatever is driving that spread, it isn't reasoning mode alone — it's likely differences in training data, RLHF, domain coverage, and calibration tuning between labs.
On Gemma 4 specifically: it shows up at 82%, which looks bad — but it's a small reasoning model being compared mostly against large reasoning models. You can't isolate whether size, training quality, or something else is the variable here.
**The deeper issue is that this is the wrong benchmark for the question being asked.** AA-Omniscience is a purely parametric test — no context is provided. It measures what's baked into the model's weights and whether the model knows the limits of what it knows. Your RAG concern — "if the model doesn't find something in its context, does it hallucinate?" — is a completely different failure mode. That's grounding faithfulness, not parametric calibration. The relevant benchmark for that is something like FACTS Grounding, not AA-Omniscience.
So the honest answer to your question is: the data we have doesn't let us conclude that small models hallucinate more. The leaderboard that seems to address it is actually a reasoning-model-only comparison that can't isolate size, and it's measuring the wrong thing for RAG use cases anyway.
andy2na@reddit
how did qwen3.5-9B get a worse score than Qwen3.5-4B?
ikkiyikki@reddit
On the LLM leaderboard (https://artificialanalysis.ai/leaderboards/models?open_weights=open_source) Qwen3.5 27b > 122b and practically on par with the 397b(!)
KaMaFour@reddit
Thinking vs non-thinking + variance
draconisx4@reddit
Hallucinations in small models like Gemma underscore why tight oversight during training and deployment is crucial without it, you're just amplifying risks in real-world use. Always build in checks for reliability before trusting these things with anything serious.
tobias_681@reddit
You shouldn't use small models to answer broad knowledge questions that you dont give them context for anyway so I don't think it matters. For RAG I think you can likely system prompt it to only draw from the documents it is given. So shouldn't be a problem.
The reason they behave different is likely a training decision. Training a model to refuse more questions means it will answer less questions right. It surprises me somewhat that Grok went this way (maybe being in the news headlines so much made them care about alignment stuff) but overall less hallucinations will likely give you a weaker model than if you didn't train it to hallucinate less.
Gemma 4 E4B scores higher than GLM 5 on this benchmark. I doubt you want to use that instead. The reason for the differtent performance is what the model creators prioritize in any given model. Most model makers have not prioritized non-hallucination in their frontier models.
It is not dependent on model size. Even Qwen3.5 0.8B (thinking) is very good at not-hallucinating (63 % in that benchmark). This is strictly about how the model was trained.
FearFactory2904@reddit
I dont know but i imagine its similar to why little kids make up wild stories when they dont know what they talking about.
Altruistic_Heat_9531@reddit
You could think LLM as a lossy world knowledege de/compressor . The base model is just that, text predictor generator with the world knowledge. the parameters act as both as thinking and also the actual memory of the model. So smaller the model might retain its thinking power relative to its bigger brethren, but the researcher may or may not, purposely remove unecessary training dataset for lower model, so it can achieve low error.
But the all model could also be trained on the same dataset at pretraine. But when the smaller model is switch to its finetuning, the model "forget" that knowledge so it can balance things out in thinking department.
I mentioned knowledge, since it is the metric that hallucination measure.
Some model do work really really well in thinking but mid in world knowledge, example Qwen 3.5 medium model, it is very good at multi turn thinking and process with long context, but don't ask what happened at december 16th 1991
k_means_clusterfuck@reddit
because in order to be so intelligent on so few parameters, you need a little insanity
FunSignificance4405@reddit
You’re right that high hallucination makes them feel useless in RAG when context is missing. The 80% nonsense rate you’re seeing is common in under-aligned small models. Bigger models aren’t magically immune either — it’s mostly training incentives rewarding confident answers over abstaining.
Halagaz@reddit
I'm surprised that you looked at this chart and that's the only thing you think about.
I mean (at least from the chart) Sonnet hallucinate less than Opus that is larger?
Also Qwen3.5 27B is significantly better than Deepseek, GPT, and oss-120B? Qwen size of 27B is tiny compared to them (and a lot of the model here)
And how tf is Qwen 4B twenty times better than Qwen 9B with half the parameters?
If anything I'd be more suspiscious of this benchmark now reading this graph, how are these measured even
ghgi_@reddit
I think a large potion of it comes down to less size = less inherent knowledge. Its more likely to think it knows something more then it does because a lot of these newer smaller models are trained off data produced by larger smarter models, distills to some extent and I think a problem with this (combined with the fact it just can't hold as much knowledge as a larger model) is its more likely to fake things or hallucinate, To compensate, most of them focus on giving it good reasoning + tool usage to gather that knowledge it needs into context so when its under tooled or has to do more guess work they fail much harder. (It also comes down to the tests themselves and tools provided in said test and how well it was trained or tuned for those tasks)
shikima@reddit
that's why I prefer qwen, thinks a lot but not hallucinate and with a good system prompt it answer in secs