Is there a place to search/submit an error/hallucination I saw running local LLM?
Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 10 comments
I'm tagging it 'Discussion' cause at the same time if there is no such commonly known place (and I was not able to find one via websearch), I propose to create one.
I have just unexpectedly observed that Crow Qwen 3.5 9B Q5 hallucinated how to change wiki page for a git project, whereas Gemma 4 26B seems to have given correct answer.
To the sake of evaluation of local models, I want to see a list of mistakes models have made and ready to contribute my observations.
dark-light92@reddit
AI models are not software in the sense you can track issues and bugs.
What you call hallucination is just normal working of the LLM. It's not an error. It predicted the most probable next token just like it always does.
A place to track hallucinations would not be useful because if you run the same prompt again against the same model, you might get different result. You change temperature or sampling settings, the responses will change again.
abitrolly@reddit
Temperature 0 responses should not be affected by randomness, right?
dark-light92@reddit
It can reduce it but llm architecture, subtle differences in floating point math on different GPU hardware, order of math operations during inference etc can also influence how tokens are predicted. It's difficult to do truly deterministic LLM inference.
It's also debatable whether or not you should want it since human language itself is not deterministic.
alex20_202020@reddit (OP)
But benchmarks try to. Through many repetitions I guess and just luck? I did not read how exactly benches are run. As from my idea, if enough data is collected from ordinary use, it might average out.
alex20_202020@reddit (OP)
Well, yes, thanks.
Song-Historical@reddit
Unless they have access to your stack there's no real way for them to determine how or why the hallucination occurred.
abitrolly@reddit
But I think Anthropic had tools to visualize what happens inside. At least their papers were full of beauty.
Song-Historical@reddit
Yes that's what I'm saying. Anthropic owns the stack, you will never own their stack because they have a vested interest in not letting you treat them like an inference utility. They will never tell you how they price out or price in hallucinations or good inference because you could make direct comparisons with other producers and go with whatever suits your workflow.
abitrolly@reddit
Unless you are willing to curate and maintain it yourself first, it is not going to happen. But the thought that LLM can learn from mistakes is rational. You just need to get proof that it works.
SM8085@reddit
As in working with a specific tool? Or hallucinated the wiki format?