Is there any <3B model with usable 200k+ context window?
Posted by madmax_br5@reddit | LocalLLaMA | View on Reddit | 15 comments
I need a small model for processing conversation transcripts from larger models, so need usable context window out to at least 200k tokens. I know some models claim to support this, but I don’t know which are actually good at this in practice.
Also desirable: low hallucination rate, not super verbose.
blastbottles@reddit
Gemma 4 E2B
AmoebaDue6638@reddit
Qwen3 0.6B and Gemma 3 1B both claim 128k+ but quality falls off a cliff past ~32k at that size. For 200k you realistically need to go API route. Orthogonal bundles access to Gemini, Claude, and others behind one key if you want to skip managing multiple providers.
madmax_br5@reddit (OP)
Unfortunately this is a mech interp use case so I’ve got to have access the the model guts and the token probabilities
rpiguy9907@reddit
Qwen 3.5 - 2B is the only game in town that I know of with 200K+ context. But if you have memory limiting you to a 2B model do you even have room for 200K+ context. That is the real question.
madmax_br5@reddit (OP)
it’s not a memory issue it’s a speed and cost at scale issue.
coder543@reddit
Qwen3.6 35B A3B only uses 3 billion parameters per token, so it runs at the speed of a 3B model while being tremendously more intelligent than any 3B model.
If you have the memory available, that is a much better choice when you need to prioritize speed and cost at scale.
madmax_br5@reddit (OP)
Are there any differences in prefill for MOE vs dense? My application needs fast prefill
coder543@reddit
Maybe half the speed for 35BA3B vs 4B under llama.cpp, but if prefill performance is that important, you can often see a significant jump by dealing with the pain in the butt that is vLLM.
A factor of two is well worth a huge jump in ability. It doesn’t matter that you can do the task quickly if it is being done wrong.
madmax_br5@reddit (OP)
I’m actually not generating any tokens at all, i’m doing token probability analysis on transcript replays, for hallucination detection; kind of like reverse speculative decoding. In my testing, even 1B models were good enough for useful data; but 3B is where models start offering useful context windows
sisyphus-cycle@reddit
Prefill will be slower for MOE if you offload it vs a fully VRAM loaded dense model (typically, but not always the case)
Enough-Astronaut9278@reddit
Sub-3B with 200k usable context is basically unicorn territory. Your best bet is chunking transcripts through Qwen2.5-3B in passes rather than trying to fit it all in one shot.
PhoneOk7721@reddit
Qwen2.5 spotted. This is a bot, report them.
madmax_br5@reddit (OP)
Not sure if chunking is possible for this use case, but i’ll try, thanks!
HVACcontrolsGuru@reddit
Try looking at the IBM Granite models? 4 or 8B parameter model for that type of task. Don’t think they have a context window that big.
dataexception@reddit
Is your limitation VRAM or system DRAM? Or is it a combination of both? If you can describe your architecture a little bit, that would help get an idea of the capabilities you have available.