Qwen Introduced FlashQLA

Posted by ResearchCrafty1804@reddit | LocalLLaMA | View on Reddit | 59 comments

Introducing FlashQLA: high-performance linear attention kernels built on TileLang.

2–3× forward speedup. 2× backward speedup.

💻 Purpose-built for agentic AI on your personal devices.

Key insights:

Gate-driven automatic intra-card CP.
Hardware-friendly algebraic reformulation.
TileLang fused warp-specialized kernels.

FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads.

Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads.

The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups.

We hope this is useful to the community!

Learn more:

📖 Blog: https://qwen.ai/blog?id=flashqla

💻 Code: https://github.com/QwenLM/FlashQLA

[-]

PaceZealousideal6091@reddit

But why just SM90? Is there a technical limitation for SM89 series implementation?

[-]

ikkiho@reddit

Real reason is that the speedup comes from a producer/consumer warp split that uses TMA (Tensor Memory Accelerator) plus WGMMA (warpgroup MMA) plus async barriers. SM90 (H100) was the first arch to ship all three, and SM100 (Blackwell) extends them.

Quick map:

TMA does async bulk loads from HBM straight into shared memory with one PTX instruction, so producer warps fire-and-forget tile loads while consumer warps run matmul on already-loaded tiles. SM89 (4090 / Ada) has no TMA, so async loads fall back to cp.async with 16-byte granularity and per-thread bookkeeping, which kills the latency hiding budget for a 16-stage pipeline.
WGMMA is a warpgroup-scoped tensor core instruction that operates on 4 warps at once and reads operands directly from shared memory. Pre-Hopper tensor cores (mma.sync) need operands in registers, so you spend register file on staging. With WGMMA you free those registers for accumulator state, which is what lets a deep pipeline fit in 228KB of SM SRAM.
The 16-stage warp-specialized pipeline line is the FlashAttention-3 (Shah 2024) playbook applied to gated-DeltaNet backward. The backward pass reverse-scans through the recurrent state, you need both forward states and gradients live simultaneously, and the only way to keep that in SRAM at long context is the deep async pipeline that WGMMA + TMA permits.

Could it be ported to SM89? Yes, but the speedup mostly evaporates. You'd be back to FlashAttention-2 style cooperative load with mma.sync, register pressure goes up, pipeline depth has to come down. At that point Songlin Yang's FLA Triton kernels (flash-linear-attention library, covers GDN, GLA, RWKV-7, Mamba2) are already a reasonable baseline on Ampere/Ada, within roughly 30% of theoretical max for those tiers. The 2-3x in the post is a headline against FLA on SM90, not a hardware-agnostic number.

Practical: matters for Qwen3-Next style hybrid models (mostly linear-attention layers with a few full-attention layers interleaved), at >32k context, on H100/H200/B200. On a vanilla full-softmax Qwen3 it does nothing; on a 4090 it's not what you want.

[-]

PaceZealousideal6091@reddit

Thanks a lot for this. This is helpful.

[-]

Healthy-Nebula-3603@reddit

Yes

myglasstrip@reddit

I want to give my experience as there seemed to be a lot of doubters, but I'm no expert I've just already experienced their work with their mnn engine. Most people are use llama, but since I use my phone or tablet I use mnn engine. The speeds I get on that are amazing, ESPECIALLY for prefill and the speeds have improved, GREATLY over the last 2 months. My speeds on mnn vs llama are a massive multiple difference, 4x I think on prefill. I'm not an expert, llama gave me speeds I expected. I only tried this because I saw the Alibaba team post about it.

Just I'd give them the benefit of the doubt and trust them considering how good they've been for the community in the first place.

They're why I believe even mobile phones heavily benefit from llms soon. Sure, most things will be in the cloud but when phones have 24 gigs of RAM now at the top end you can run local models. Wait till we get to 32 gigs of RAM on device, then you can easily run something like qwen moe for your intrusive thoughts or whatever.

pmttyji@reddit

Requirements

SM90 or above
CUDA 12.8 or above
PyTorch 2.8 or above

Far-Low-4705@reddit

noooo

me sitting in the corner with my AMD MI50's not getting support for 90% of new LLM inference tech

wektor420@reddit

Rtx 6000 pro bros share your pain

XForceForbidden@reddit

Cry and hold my 4090 48G

kapitanfind-us@reddit

What? Where did you get that :)

simracerman@reddit

China, maybe

zenmagnets@reddit

Turns out SM100 (Blackwell) and SM120 (Fake Consumer Blackwell) are both not above SM90

Bruh.. last I checked 120 > 90. Unless Nvidia can’t do simple math

extopico@reddit

What the fk is CP? Just don’t abbreviate everything. Please.

saunderez@reddit

Context processing maybe? I know you were thinking cheese pizza though....

VoiceApprehensive893@reddit

q3.6 9b when

2muchnet42day@reddit

Nice

Let's see if it works on sm120

Blackdragon1400@reddit

Not gonna lie I don’t understand 90% of the buzzwords on that webpage.

MaxKruse96@reddit

gguf wen

Cool-Chemical-5629@reddit

This is not a model, but rather something to enhance the architecture of inference engines. So if GGUF models are relevant to you, instead of asking "GGUF when" ask "Llama.cpp support when".

craftogrammer@reddit

Llama.cpp support when

themoregames@reddit

So basically you are saying, if we download this, we can run full Deep Seek v4 Pro on a Raspberry Pi 5 8 GB?

I'm afraid this won't reduce the memory demand, but it should give some extra speed boost for inference. In OP, there's a link to a blog post that explains in detail what is it about, but it gets very technical early on.

hmm

How about cute kitten videos I can send to people on WhatsApp? Can it do that?

-illusoryMechanist@reddit

gguf when

Intelligent-Form6624@reddit

mom, is that you?

No, it's your daddy. Are you winning, son?

/s wen

RapidRaid@reddit

/q wen

gguf nao?

qwen_next_gguf_when@reddit

Sm90 or above.

Alternative_You3585@reddit

So it includes consumer Blackwell right? Or only the server stuff like hopper? Am not so profound in GPUs instruction set

Server only

Nope. Doesn't even include Server SM100 Blackwell.

Maleficent-Ad5999@reddit

What does it mean? Me noob

rmhubbert@reddit

SM90 or above. Boo!

t3rmina1@reddit

Finally, something that works on my SM120 (maybe)

Spoiler: It doesn't work on your SM120 fake consumer-tier blackwell. For that matter, it doesn't even work on SM100 server blackwell (yet)

CircularSeasoning@reddit

Buy 4x used 3090s they said. It'll be fun they said.

International-Try467@reddit

HANK DON'T ABBREVIATE CYBERPUNK

eidrag@reddit

credit points!!

PANIC_EXCEPTION@reddit

pokemon go combat power

Borkato@reddit

Optimized for WHAT???

the antis

FaceDeer@reddit

You can fit so much CP into this thing!

LightBrightLeftRight@reddit

So, LOCAL for those of us with an H100 sitting around

No_Afternoon_4260@reddit

that or any blackwell

If it needs Hopper then consumer Blackwells like 5090 won’t necessarily support the primitives it uses. I don’t think they’ve explicitly said whether it will be ported over but if they’re using H200s to test then we don’t have enough information yet, the backwards compatibility has gotten muddled this past generation.

Hopefully they clarify, and if 50-series works it will make that a big leap for that series’ usefulness!

ResearchCrafty1804@reddit (OP)

Forward and backward benchmark results across common configurations.

ResidentPositive4122@reddit

Whoever started this trend of one line in a color and the rest in gray (or white, or other shades of ONE color) deserves to be trolled by having every query they ever make to an LLM be secretly routed to a llama1 finetuned to talk like a parrot.

Constandinoskalifo@reddit

My supervisor told me that it's for colour-blind people.

LinkSea8324@reddit

LLM be secretly routed to a llama1 finetuned to talk like a parrot.

wow dude