MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s

Posted by Sea-Speaker1700@reddit | LocalLLaMA | View on Reddit | 42 comments

I've spent some time building a custom gfx12 mxfp4 kernel since the included kernels rely on marlin, or are gpt oss 120b only and that model is a non-standard implementation.

I have done tuneable Op for 9700s and added the matix configs. This repo already has the upgraded Transformers version for inference using Qwen3.5 installed into it.

Happy inferencing, maybe someday the kernel will get merged upstream, so we can all run mxfp4 on default vllm docker images, but I won't be the one to do it. Works for me as is, within 5% of GPTQ INT4 performance, roughly exactly half the decode of the GPT OSS 120B and 60% of the prefill speed.

Locked to only gfx12 series cards because I dont have older cards to test on, but, in theory this kernel is universal dequant code path that makes it a truly mxfp4 standards compliant kernel that runs anywhere.

https://hub.docker.com/repository/docker/tcclaviger/vllm-rocm-rdna4-mxfp4/general