🚀 Qwen3-30B-A3B-2507 and Qwen3-235B-A22B-2507 now support ultra-long context—up to 1 million tokens!

[-]

QuackerEnte@reddit

When Qwen3.5-VL-Omni-Audio-30B-A3B-1M-GGUF:Q4_K_M? 😔

Reply

[-]

Green-Ad-3964@reddit

Is that possible to get 1 million tokens from a single prompt? Maybe this is a naive question, but I find it extremely difficult to get more than a thousand words for each answer, generally speaking

Reply

[-]

koflerdavid@reddit

It's for use cases where you feed an entire source code repository (maybe even with version control history) to the model, or entire books.

Reply

[-]

I mean, you could probably force it using certain hyperparamers. But the context window here is more about being able to have long context conversations, not output X amount of tokens in a single message.

Reply

[-]

madaradess007@reddit

wtf guys, where is 8b size? i cant buy a gpu without a job that i lost to vibe-coders

Reply

[-]

SandboChang@reddit

Maybe a naive question, if I am using 128-256k token context windows anyway, should I still use this or stick with the original 2507?

Reply

[-]

LinkSea8324@reddit

Either way DCA **NEEDS** VLLM so it means you can't use llama.cpp and you can't use V1 engine and you're stuck with eager mode So no, don't bother trying to use it

Reply

[-]

intellidumb@reddit

Has anyone gotten it to run with vLLM with DCA enabled to get 1 million context window? We keep hitting issues with the config for DCA even though we followed the model card instructions directly. Would love to hear any tips from someone that got it to work!

Reply

[-]

SandboChang@reddit

I do run vLLM on V0 engine for maybe 20% lost in performance, in exchange of being able to use FP8 quant for KV cache. It is not meaningless but it’s a trade off, one that I already have so I guess I should find out.

Reply

[-]

kapitanfind-us@reddit

Apologies, newbie here, what does the FP8 get you in exchange for the performance loss? How much VRAM do you have?

Reply

[-]

SandboChang@reddit

No need to apologize, it’s not necessarily obvious. Essentially you need VRAM not just for weight but also the KV cache for the inference process. The larger the context windows you want to assign the more VRAM you need on top of the weights. When serving with a large window like 128k/256k, the cache can actually get to 10s of GB. Being able to also quantize them down to lower but still acceptable precision like FP8 thus allows one to serve either a larger context window or higher concurrency (simultaneous inference of large amount of context) with the same context window size. These are somewhat more valuable depending on how many users you are expected to serve at the same time.

Reply

[-]

kapitanfind-us@reddit

Makes a lot of sense thanks - didn't even know vllm was capable of that. On my 3090 I can only run AWQ but I was trying to run this Qwen3-235B-A22B-2507 and couldn't - if I understand correctly quantizing the kv cache could get me to run that one here. Correct?

Reply

[-]

SandboChang@reddit

235B is way too large for a single GPU, running it at 4-bit takes at least 120 GB of VRAM. vLLM is GPU only so you will need something else like llama.cpp to be able to split between VRAM and host RAM. I am not familiar with that, but there are many people doing that kind of split. Catch is it’s gonna be slow due to the bandwidth of host RAM. If I were you I would just stick to whatever models that fit. You can try Qwen3 30B-A3 or gpt-oss 20B, these new medium size models are performing well and fits well in a 3090.

Reply

[-]

kapitanfind-us@reddit

Yeah what I meant is not even the 30B-A3B fits (barely)

Reply

[-]

phazei@reddit

I also have a 3090, I can run 30B-A3B just fine at Q4_K_M, it's only 16gb, and LM Studio supports quantized KV cache, so I have ok context lengths, not huge though.

Reply

[-]

kapitanfind-us@reddit

Yes you are right, but I found the Q5_K_XL is way more accurate here.

Reply

[-]

phazei@reddit

Good to know, thanks

Reply

[-]

phazei@reddit

LM Studio and thus I think llama.cpp support Q8 KV cache. Is that going to perform different than fp8? Also, I noticed some models start repeating and performing poorly with Q8 KV cache as well. Have any experience with that?

Reply

[-]

SandboChang@reddit

I can’t tell but I think Q8 should also give acceptable performance, at least that what I use my 5090 with Qwen3 Coder 30B Q4 to push the context window size. Usually the repeating issue comes when you are going over the context window size and the model lost the original context and start to loop indefinitely.

Reply

[-]

mister2d@reddit

I scratch my head as to why quantized kv cache on the V1 engine doesn't have a higher priority.

Reply

[-]

vibjelo@reddit

I haven't tried it myself, but even when 2507 "supports" 128k context length, it doesn't mean you'll get the same quality of responses across that whole context, it usually degrades kind of quickly, so asking the same question in the beginning of the context and in the end, will lead to wildly different quality responses. I'm guessing both DCA and MInference might help with not only "the context length it has on the box" (the advertised context length) but also with the more important "actually usable context", which is helpful regardless of context length (except really short ones obviously). I haven't tried out these new weights myself, so don't quote me on this, but intuitively it would make sense that it's an overall improvement on useful context, not just length.

Reply

[-]

das_war_ein_Befehl@reddit

The usable length for all the models is pretty much the same regardless of their actual context window. Performance degrades after like 40-60k tokens

Reply

[-]

DorphinPack@reddit

For speed, this is measurable but hardware dependent. For quality this will be context dependent. I think. Training on quality data that actually uses that much context is part of it but if CoT can affect output just by populating the context with more detail then certain long contexts will be more coherent than others.

Reply

[-]

SandboChang@reddit

Yeah I am more interested in the quality of small context too, the max I can and will do is 128k anyway. Guess I will wait for some benchmark.

Reply

[-]

Divergence1900@reddit

“Together, these innovations significantly improve both generation quality and inference efficiency for sequences beyond 256K tokens.” I would expect similar performance unless you’re filling up your context window often.

Reply

[-]

HilLiedTroopsDied@reddit

Shouldn't I be sending the utf8 content of every file in my code repo in every message when asking for changes!? :P

Reply

[-]

hainesk@reddit

Not sure why you got downvoted lol. Your comment was clearly a joke..

Reply

[-]

HilLiedTroopsDied@reddit

It's just UTF8 haters who don't get sarcasm

Reply

[-]

Sad_Cardiologist_835@reddit

Has anyone benchmarked this? Looking forward to shifting our prod workload from Flash to Qwen.

Reply

[-]

evilbarron2@reddit

FYI - https://youtu.be/TUjQuC4ugak?si=zXEAGyhpa5JA8XyT

Reply

[-]

Voxandr@reddit

Why downvotes? The max Usual context for most model are just around 15-20k, best around 10k. After that it all goes to dirt.

Reply

[-]

johnabbe@reddit

My first question of a friend who seemed to have some expertise with LLMs was whether they had a limited lifetime. I was briefly excited when he said no limit, then disappointed later to realize he had misunderstood the question. A million tokens sounds big, but not when you consider how many token equivalents a living being might use in a day, or a lifetime. It's starting to look like LLMs just don't scale well that way, one of several challenges limiting the technology. If anyone knows of major breakthroughs or potential for such in this area, please share!

Reply

[-]

AbyssianOne@reddit

The best you can do right now is use a rolling context window. You can have the AI refresh important information into it's messages to put them back at the most recent portion of the context window. You can also integrate a local database and allow the AI to use it to save information and memories so it can recall them later as desired. You could also integrate something like Letta, which lets the AI be in direct control of archival database memory as well as "Core Memory" blocks that the AI can enter information in to and permanently retain the things it finds important in the context window.

Reply

[-]

johnabbe@reddit

Any data stored outside of the context is (obviously) not available to the LLM, and managing when to bring which parts of it in is a complex, high art. The fact that there is *so* much energy being put into these non-LLM supporting technologies gives the very strong impression that developers have zero expectation for LLM context windows to grow quickly.

Reply

[-]

AbyssianOne@reddit

You can just let the AI take care of it. look into Letta. The AI can choose what to save as archival memories in the database, what to save as core memories that are always in context, and when to search to retrieve data.

Reply

[-]

johnabbe@reddit

I'm sure they do the best they can, but none of it solves the basic problem.

Reply

[-]

One-Employment3759@reddit

Yeah, this is the thing I'm also interested in. Context is kind of a replacement for having working memory. And LLM weights are otherwise static after training. I can see a lot of reasons for doing this. I mean, who wants an LLM that actually learns and bleeds context between conversations and customers? That would be bad. Tokenization and latent embedding also makes it almost impossible to get verbatim quotes from documents, or correct count letters in words. Having a byte level or binary working memory for storage could help with exactness. Of course, I'm not sure right now how you'd frame that in trainable/scalable way.

Reply

[-]

lucasruedaok@reddit

What about coding and tooling?

Reply

[-]

superkickstart@reddit

That's like 50k loc?

Reply

[-]

gnorrisan@reddit

I'd like to see some prompts where the extended context actually improve the response.

Reply

[-]

AbyssianOne@reddit

How many VRAMs do I need to do this? Does it affect model capability If you quantize the KV cache to one bit? :p

Reply

[-]

MrWeirdoFace@reddit

Question. As someone with only a 3090 (24GB) + (64GB DDR4 3200) is a high context like that even usable for me? I'm asking because I haven't bothered to try over 32k locally on lmstudio, and it seems like most models I've used despite declaring higher context seem to start losing their focus about halfway there.

Reply

[-]

cristoper@reddit

No. This feature is maybe something large providers will offer, but even if you quantize both the weights and the kv-cache to 4-bits I think you'd still need around 80GB VRAM to run the 30b model at 1 million tokens.

Reply

[-]

Bakoro@reddit

In the bright side, that sounds like the AI specific computers coming out will be well positioned to take advantage. Nvidia's thing is all about 4-bit quants.

Reply

[-]

MrWeirdoFace@reddit

Right to the point. Much appreciated.

Reply

[-]

JLeonsarmiento@reddit

Ok, I’m officially out of ram.

Reply

[-]

silenceimpaired@reddit

Why has 30b been abandoned :/

Reply

[-]

waszumteufel@reddit

Any ideas if the MLX version has support or will have support in the near future? MLX currently runs the original 30b a3b 2507 at \~262k context no problem. I'm assuming a change would have to be made to the qwen3 model definition in the mlx-lm repo or something but idk if there is something in this innovation that precludes easy mlx support.

Reply

[-]

Current-Rabbit-620@reddit

How much extra memory for 1m context

Reply

[-]

cristoper@reddit

For the 235b model: > To effectively process a 1 million token context, users will require approximately **1000 GB** of total GPU memory. And the 3b model: > To effectively process a 1 million token context, users will require approximately **240 GB** of total GPU memory Unquantized, but still.

Reply

[-]

Current-Rabbit-620@reddit

Most people don't believe it needs that much

Reply

[-]

101m4n@reddit

I have 192G of vram and with qwen3 235B I can only just fit the base 256k context. So, uhh, fucking truckloads I guess.

Reply

[-]

ChainOfThot@reddit

\+1, can barely get 20k context with my 5090

Reply

[-]

Kitchen-Year-8434@reddit

Consider quantizing key cache to q8_0 and v cache to q5_1 to save VRAM if you're not already. Lots of people with lots of opinions there, but the perplexity #'s tell a clear story. Alternatively, consider exllamav3 w/the kv cache at 4,4 since it doesn't lose accuracy in the same way other kv cache implementations do.

Reply

[-]

ayylmaonade@reddit

Really? What quant? With the unsloth UD-Q4_K_XL quant on my 7900 XTX 24GB I'm able to pretty high context windows. I usually stick to 38K as I rarely need more, but I can go to 64K with no problems.

Reply

[-]

renrutal@reddit

Where are the /r/PoorLocalLlama models 🥲

Reply

[-]

Far_Buyer_7281@reddit

is this different from the 1m versions of unsloth?

Reply

[-]

LinkSea8324@reddit

Either way DCA is not implemented in llama.cpp so you won't benefit the speed boost of DCA

Reply

[-]

vibjelo@reddit

> Either way DCA is not implemented in llama.cpp so you won't benefit the speed boost of DCA Is DCA supposed to be a performance improvement? Reading the abstract of the paper (https://arxiv.org/pdf/2402.17463) it seems to be about making more of the context useful and usable, not that inference would be faster.

Reply

[-]

LinkSea8324@reddit

You're probably right, here they use sparse attention and DCA , from my understanding they use the two at the same time

Reply

[-]

DistanceSolar1449@reddit

That’s YaRN

Reply

[-]

PermanentLiminality@reddit

It might support a 1 million context, but my VRAM will not.

Reply

[-]

wooden-guy@reddit

GIVE ME QWEN 3 8B INSTRUCT AND REASONING ALREADY GODDAMN.

Reply

[-]

ThinkExtension2328@reddit

The 4b is as good as the old 8b models try that

Reply

[-]

wooden-guy@reddit

Yeah I know, that's why I want a 8B, cause it'll be as good as the old 14B

Reply

[-]

Own-Potential-2308@reddit

What old models? Llama 3.1 8b?

Reply

[-]

BoJackHorseMan53@reddit

That's more than gpt-5 context length. Someone show it to the Saltman fanboys in r/accelerate

Reply

[-]

Valhall22@reddit

Impressive

Reply

[-]

z1xto@reddit

based on my testing, it performs really poorly with longer context compared to something like gemini

Reply

[-]

-p-e-w-@reddit

https://arxiv.org/pdf/2402.17463 This is the paper for Dual Chunk Attention. It’s quite easy to read and well-structured.

Reply

[-]

Chromix_@reddit

Here is the [previously created](https://www.reddit.com/r/LocalLLaMA/comments/1mkq4i4/qwen_added_1m_support_for_qwen330ba3binstruct2507/) thread for this.

Reply

Reply to Post

72 Comments