Long-context performance at lower quants

Posted by _TheWolfOfWalmart_@reddit | LocalLLaMA | View on Reddit | 3 comments

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k context use, it starts to get dumb all of a sudden.

It just hits a brick wall and quality deteriorates rapidly and drastically. It'll begin hallucinating, forgetting things, or think something it said/suggested was actually something that I said.

I found I have to compact before I get to that point, and then it keeps going on just fine.

Is this because I'm running Q3? Unfortunately Q4 is just outside of the capability of my system specs unless I want to start disk swapping.

So is it just an issue with this particular model? Or because it's Q3? Are there llama.cpp settings that can help?

I'm already using BF16 KV cache.

[-]

Long-context performance at lower quants

Reddich07@reddit

Serveurperso@reddit

Blues520@reddit