Long-context performance at lower quants
Posted by _TheWolfOfWalmart_@reddit | LocalLLaMA | View on Reddit | 3 comments
I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden.
It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said.
I found I have to compact before I get to that point, and then it keeps going on just fine.
Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping.
So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help?
I'm already using BF16 KV cache.
Reddich07@reddit
This is quite normal and even exists for frontier models at around 100.000 tokens, because the LLM gets confused by too much context. Sometimes called dumb zone: https://github.com/mattpocock/dictionary-of-ai-coding#smart-zone
Serveurperso@reddit
Essai le Q4_K_XL d'unsloth avec KV Cache en Q8. Pour 96GB VRAM.
Blues520@reddit
What hardware are you running it on and is it really better than qwen 3.6 27b?