Qwen Introduced FlashQLA
Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 59 comments
Introducing FlashQLA: high-performance linear attention kernels built on TileLang.
2–3× forward speedup. 2× backward speedup.
💻 Purpose-built for agentic AI on your personal devices.
Key insights:
-
Gate-driven automatic intra-card CP.
-
Hardware-friendly algebraic reformulation.
-
TileLang fused warp-specialized kernels.
FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.
Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.
The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups.
We hope this is useful to the community!
Learn more:
📖 Blog: https://qwen.ai/blog?id=flashqla
💻 Code: https://github.com/QwenLM/FlashQLA
myglasstrip@reddit
I want to give my experience as there seemed to be a lot of doubters, but I'm no expert I've just already experienced their work with their mnn engine. Most people are use llama, but since I use my phone or tablet I use mnn engine. The speeds I get on that are amazing, ESPECIALLY for prefill and the speeds have improved, GREATLY over the last 2 months. My speeds on mnn vs llama are a massive multiple difference, 4x I think on prefill. I'm not an expert, llama gave me speeds I expected. I only tried this because I saw the Alibaba team post about it.
Just I'd give them the benefit of the doubt and trust them considering how good they've been for the community in the first place.
They're why I believe even mobile phones heavily benefit from llms soon. Sure, most things will be in the cloud but when phones have 24 gigs of RAM now at the top end you can run local models. Wait till we get to 32 gigs of RAM on device, then you can easily run something like qwen moe for your intrusive thoughts or whatever.
pmttyji@reddit
Requirements
Far-Low-4705@reddit
noooo
me sitting in the corner with my AMD MI50's not getting support for 90% of new LLM inference tech
wektor420@reddit
Rtx 6000 pro bros share your pain
XForceForbidden@reddit
Cry and hold my 4090 48G
kapitanfind-us@reddit
What? Where did you get that :)
simracerman@reddit
China, maybe
zenmagnets@reddit
Turns out SM100 (Blackwell) and SM120 (Fake Consumer Blackwell) are both not above SM90
simracerman@reddit
Bruh.. last I checked 120 > 90. Unless Nvidia can’t do simple math
extopico@reddit
What the fk is CP? Just don’t abbreviate everything. Please.
saunderez@reddit
Context processing maybe? I know you were thinking cheese pizza though....
VoiceApprehensive893@reddit
q3.6 9b when
2muchnet42day@reddit
Nice
PaceZealousideal6091@reddit
But why just SM90? Is there a technical limitation for SM89 series implementation?
ikkiho@reddit
Real reason is that the speedup comes from a producer/consumer warp split that uses TMA (Tensor Memory Accelerator) plus WGMMA (warpgroup MMA) plus async barriers. SM90 (H100) was the first arch to ship all three, and SM100 (Blackwell) extends them.
Quick map:
Could it be ported to SM89? Yes, but the speedup mostly evaporates. You'd be back to FlashAttention-2 style cooperative load with mma.sync, register pressure goes up, pipeline depth has to come down. At that point Songlin Yang's FLA Triton kernels (flash-linear-attention library, covers GDN, GLA, RWKV-7, Mamba2) are already a reasonable baseline on Ampere/Ada, within roughly 30% of theoretical max for those tiers. The 2-3x in the post is a headline against FLA on SM90, not a hardware-agnostic number.
Practical: matters for Qwen3-Next style hybrid models (mostly linear-attention layers with a few full-attention layers interleaved), at >32k context, on H100/H200/B200. On a vanilla full-softmax Qwen3 it does nothing; on a 4090 it's not what you want.
PaceZealousideal6091@reddit
Thanks a lot for this. This is helpful.
Healthy-Nebula-3603@reddit
Yes
wektor420@reddit
Let's see if it works on sm120
Blackdragon1400@reddit
Not gonna lie I don’t understand 90% of the buzzwords on that webpage.
MaxKruse96@reddit
gguf wen
Cool-Chemical-5629@reddit
This is not a model, but rather something to enhance the architecture of inference engines. So if GGUF models are relevant to you, instead of asking "GGUF when" ask "Llama.cpp support when".
craftogrammer@reddit
Llama.cpp support when
themoregames@reddit
So basically you are saying, if we download this, we can run full Deep Seek v4 Pro on a Raspberry Pi 5 8 GB?
Cool-Chemical-5629@reddit
I'm afraid this won't reduce the memory demand, but it should give some extra speed boost for inference. In OP, there's a link to a blog post that explains in detail what is it about, but it gets very technical early on.
themoregames@reddit
hmm
How about cute kitten videos I can send to people on WhatsApp? Can it do that?
-illusoryMechanist@reddit
gguf when
Intelligent-Form6624@reddit
mom, is that you?
Cool-Chemical-5629@reddit
No, it's your daddy. Are you winning, son?
MaxKruse96@reddit
/s wen
RapidRaid@reddit
/q wen
Intelligent-Form6624@reddit
gguf nao?
qwen_next_gguf_when@reddit
Sm90 or above.
Alternative_You3585@reddit
So it includes consumer Blackwell right? Or only the server stuff like hopper? Am not so profound in GPUs instruction set
Healthy-Nebula-3603@reddit
Server only
zenmagnets@reddit
Nope. Doesn't even include Server SM100 Blackwell.
Maleficent-Ad5999@reddit
What does it mean? Me noob
rmhubbert@reddit
SM90 or above. Boo!
t3rmina1@reddit
Finally, something that works on my SM120 (maybe)
zenmagnets@reddit
Spoiler: It doesn't work on your SM120 fake consumer-tier blackwell. For that matter, it doesn't even work on SM100 server blackwell (yet)
CircularSeasoning@reddit
Buy 4x used 3090s they said. It'll be fun they said.
International-Try467@reddit
HANK DON'T ABBREVIATE CYBERPUNK
eidrag@reddit
credit points!!
PANIC_EXCEPTION@reddit
pokemon go combat power
Borkato@reddit
Optimized for WHAT???
FaceDeer@reddit
Borkato@reddit
LightBrightLeftRight@reddit
So, LOCAL for those of us with an H100 sitting around
No_Afternoon_4260@reddit
that or any blackwell
LightBrightLeftRight@reddit
If it needs Hopper then consumer Blackwells like 5090 won’t necessarily support the primitives it uses. I don’t think they’ve explicitly said whether it will be ported over but if they’re using H200s to test then we don’t have enough information yet, the backwards compatibility has gotten muddled this past generation.
Hopefully they clarify, and if 50-series works it will make that a big leap for that series’ usefulness!
ResearchCrafty1804@reddit (OP)
Forward and backward benchmark results across common configurations.
ResidentPositive4122@reddit
Whoever started this trend of one line in a color and the rest in gray (or white, or other shades of ONE color) deserves to be trolled by having every query they ever make to an LLM be secretly routed to a llama1 finetuned to talk like a parrot.
Constandinoskalifo@reddit
My supervisor told me that it's for colour-blind people.
LinkSea8324@reddit
wow dude
ProfessionalSpend589@reddit
I’ve read a book 2 years ago (don’t judge me) where it was recommended to have your main data be salient.
It would help people to quickly orient on the important things in your graphs.
666666thats6sixes@reddit
Those are best practices from the age of analog photocopiers, where colors and even consistent shades of grey weren't really a thing and each copy degraded the image.
No idea why use them today though.
Xamanthas@reddit
At least the patterns are different :)
No_Conversation9561@reddit
Does it improve speed of existing Qwen3.6 models?
RandiyOrtonu@reddit
nice to see tilelang getting the recognition it needed
Hodler-mane@reddit
sm90+ only (h100s, blackwell etc). will only speed up pp by like 30% ish and nothing for tg.