MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s | TheaterFire

MXFP4 kernel, RDNA 4, Qwen3.5 122B Quad R9700s

Posted by Sea-Speaker1700@reddit | LocalLLaMA | View on Reddit | 42 comments

I've spent some time building a custom gfx12 mxfp4 kernel since the included kernels rely on marlin, or are gpt oss 120b only and that model is a non-standard implementation.

I have done tuneable Op for 9700s and added the matix configs. This repo already has the upgraded Transformers version for inference using Qwen3.5 installed into it.

Happy inferencing, maybe someday the kernel will get merged upstream, so we can all run mxfp4 on default vllm docker images, but I won't be the one to do it. Works for me as is, within 5% of GPTQ INT4 performance, roughly exactly half the decode of the GPT OSS 120B and 60% of the prefill speed.

Locked to only gfx12 series cards because I dont have older cards to test on, but, in theory this kernel is universal dequant code path that makes it a truly mxfp4 standards compliant kernel that runs anywhere.

https://hub.docker.com/repository/docker/tcclaviger/vllm-rocm-rdna4-mxfp4/general

[-]

djdeniro@reddit

Hey it's works amazing! May i ask you share dockerfile to get build for 8x R9700?
like this image tcclaviger/vllm-rocm-rdna4-mxfp4, just now got 2 more R9700

[-]

Sea-Speaker1700@reddit (OP)

Should work but I'll have a look. Added some more things to improve it, there's a much ..much faster prefix caching mechanism now in place for MTP mode align that lets you set the number of prefixes to cache using a cusa graph stored buffer, about 40% prefix hit recall speed and recompute on a miss is partial now.

Will be also squashing to make the docker image smaller :)

Dropping a new one soon.

[-]

djdeniro@reddit

i see only two images here, can you invite me into your repo? i will try to clone it and modify for 8xGPU. in your repo only 4 support

[-]

djdeniro@reddit

u/Sea-Speaker1700 i got result success running 397B model. but...

(Worker_TP0 pid=142) INFO 04-09 18:40:57 [monitor.py:76] Initial profiling/warmup run took 173.45 s (Worker_TP0 pid=142) INFO 04-09 18:41:01 [gpu_worker.py:456] Available KV cache memory: 0.51 GiB (EngineCore pid=107) INFO 04-09 18:41:01 [kv_cache_utils.py:1316] GPU KV cache size: 8,448 tokens (EngineCore pid=107) INFO 04-09 18:41:01 [kv_cache_utils.py:1321] Maximum concurrency for 32,000 tokens per request: 0.97x

So, concurrency for 32k context limited, but it launch! i think it's because this model loads from mixed q6 and q4 quants, and in case of fully mxfp4 we will got normally context size.

[-]

Sea-Speaker1700@reddit (OP)

GPU KV cache size: 8,448 tokens

As I understand it, the first number doesn't really apply to mixed attention models that use GDN, the 8448 part.

On my setup I get GPU KV cache size: 126,000 or something like that and a Max concurrency at 262,144 of 1.23. I have a test prompt that 209,xxx tokens long and it works fine with no evictions so the second value, is definitely the one to watch.

I have a DCP implementation I'm working on that carries the performance at long contexts much much higher, +50% decode or so, but it hurts <50k context decode. Working on making DCP behavior dynamic so its on when needed and otherwise off. Also increases KV space :P

[-]

djdeniro@reddit

i quantize it myself from FP8 to MXFP4, and it's works well now. getting 33-34 t/s without using mtp

[-]

Sea-Speaker1700@reddit (OP)

This should work for TP8, has new side buffer that makes prefix caching actually work correctly in align mode with GDN. See repo readme for new variable to define side buffer entry size:
https://hub.docker.com/repository/docker/tcclaviger/vllm-rocm-mxfp4-nvfp4/general

[-]

djdeniro@reddit

in case of 397B always got OOM.

in 122B model with 8xR9700 got up to 140 t/s on some prompts with mtp 2

397B GPTQ version output is exclamation mark in --dtype float16

[-]

Sea-Speaker1700@reddit (OP)

More MTP will raise it, even 4 works well on 122.

I'll see if I can find the bug for 397, but I can't test it so I can't validate that it's not the quant being damaged.

Which 397 quant did you use?

[-]

djdeniro@reddit

i try to load this one, with MTP it's not work, because it's AMD Qark i think.

https://huggingface.co/amd/Qwen3.5-397B-A17B-MXFP4

[-]

djdeniro@reddit

i hope this also value info:

(Worker_TP0 pid=552) WARNING 04-07 20:05:36 [quark_moe.py:765] The current mode (supports_mx=False, use_mxfp4_aiter_moe=None, ocp_mx_scheme=OCP_MX_Scheme.w_mxfp4_a_mxfp4) does not support native MXFP4/MXFP6 computation. Simulated weight dequantization and activation QDQ (quantize and dequantize) will be used, with the linear layers computed in high precision.

[-]

Sea-Speaker1700@reddit (OP)

At a glance this looks like an incompatible quantizaiton.

OCP_MX_Scheme

[-]

djdeniro@reddit

Hey! it's was my mistake, i don't disable AITER and some argiments from your docker setup guide in Dockerhub.

Now GPU is not overloaded all time (not show 100% cpu load for them).

will test with 397B soon!

[-]

djdeniro@reddit

btw, maybe you have an recipe for making your own MXFP4 quantization?

[-]

djdeniro@reddit

Yes i think same, i try to test they quantization before and it's never works without --enforce eager

[-]

djdeniro@reddit

RuntimeError: The size of tensor a (2048) must match the size of tensor b (4096) at non-singleton dimension 1

parameters to launch (later i replace 155k max-model-len to 64k)

non-default args: {'model_tag': '/app/models/models/vllm/Qwen3.5-397B-A17B-MXFP4', 'chat_template': '/app/models/models/vllm/chat_template.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/app/models/models/vllm/Qwen3.5-397B-A17B-MXFP4', 'trust_remote_code': True, 'max_model_len': 155648, 'served_model_name': ['model'], 'reasoning_parser': 'qwen3', 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.967, 'enable_prefix_caching': True, 'max_num_seqs': 16, 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}, 'compilation_config': {'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_endpoints': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 32, 64, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}}

[-]

djdeniro@reddit

i run with -tp 8 and it's work! now will try to launch 397B Model. but no success, maybe mxfp4 quantization from amd does not work from the box

[-]

djdeniro@reddit

my question was before test (i will test it soon, but i found the 0,1,2,3 visible devices in Docker Hub) this is reason why i am ask

[-]

djdeniro@reddit

Thank you!!! I just runned pull it, will connect all 8 gpu at this night!

[-]

sloptimizer@reddit

Does MTP work with this setup? That would make it much better than existing kernels.

[-]

Sea-Speaker1700@reddit (OP)

Cause of MTP failure identified, working to resolve it now :P

[-]

Capital_Evening1082@reddit

The memory error is fixed by upgrading to ROCm 7.2 and compiling vllm against that.
The ROCm 7.0 based docker images all suffer from the memory access / page fault errors.

[-]

Sea-Speaker1700@reddit (OP)

Good to hear, it's the last thing I was fighting and unable to isolate.

Now have MTP 4 working at ~80% acceptance for 122B.

Sub 30k context is ~135tps decode, but it falls off a cliff after that. There's an interaction between GDN/TRITON or some other unidentified area that appears from the speed to be making it scale from On to O^x, is the falloff is much steeper after inflection than On^2.

[-]

Capital_Evening1082@reddit

I have MTP working, too, thanks to your comment above :)
Acceptance is \~80% for 27B and I'm getting \~50t/s generation for a single request on 2x R9700, so the dense model is now usable.
I'll try 122B on the 4x R9700 setup on Monday.
What are you using for benchmarking?

[-]

Sea-Speaker1700@reddit (OP)

I have a custom benchmark I wrote that sends book sections, code, etc a bunch of stuff. Glad it's up for you!

Going to push a new image today that has: ROCM 7.2.1 My updated MXFP4 kernel using TRITON matmul_ogs (plays nicer with vLLM) vLLM 18.2 AMD pytorch and all dependencies explicitly for gfx1201 Graceful mxfp4 on rdna4 detection so shouldn't interfere with any other models My custom all_gather buffer (massive perf uplift on TP4 high concurrency)

122B: ~130 TPS < 50k Gracefully degrades to ~70 at 110k

I compiled it from scratch, and it no longer falls apart at high context. Its big right now but I'll slim it soon, like 19gb can be tripped out of it.

I'm super close to having the CK attention working, it runs but some garbage comes(some correct) out, so going to release before that is done may take a bit. CK attention basically solves AMD falloff at high context completely. Would allow for massive context multi parallel agents.

[-]

Capital_Evening1082@reddit

OMG that is amazing!
For me at the moment the biggest issue is tunableops causing reboots (on both EPYC and Xeon platforms).
Without them MXFP4 performance is only 50% of GPTQ-INT4.
Or is the performace difference coming from something else and tunableops aren't actually needed?

[-]

Sea-Speaker1700@reddit (OP)

There is an update, check the side-buffer tag, works great

[-]

sloptimizer@reddit

Yes, please... I really want to see some MTP with a usable model runnong on AMD cards!!!

[-]

Sea-Speaker1700@reddit (OP)

Initial results:

[-]

Sea-Speaker1700@reddit (OP)

It will later today :P I have it running now over 80tps decode using MTP.

[-]

sloptimizer@reddit

Could you share a link to your git repo?

[-]

Sea-Speaker1700@reddit (OP)

Haven't attempted yet, will be next hurdle after NVFP4 RDNA emualtion kernel validation.

[-]

Sea-Speaker1700@reddit (OP)

Image updated, 7.2.1 rocm and vLLM 18, no nvfp4 support in it yet.

Much better performance.

Will work on getting a non-dev-env sized image built soon.

[-]

Sea-Speaker1700@reddit (OP)

Have now migrating from external custom kernel into triton matmul based kernels, like gpt oss, but using silu_and_mul instead of swidlu for the intermediate. Have not pushed the updated image, will do after migrations below.

MTP resolved and now scaling wildly well with each added predictive token, 135 TPS for an A10B is great at MTP 4.

No MTP sits solidly in the mid 50s tapering to high 30s at 100k kv.

RDNA4 gemm sweep completed using the mxfp4 design and have identified ideal MNKG, so those are integrated into the kernel directly.

Built and will be inserting ck attention tonight and migrating it all to rocm 7.2 based container against the openai-rocm branch to keep it updated. No idea if ck attention will play nicely with GDN or the ViT yet, will see.

Original goal met and surpassed, faster than upstream gptq unified kernel, just need to extend throughout stability beyond 50k context use now.

Side: did play with a distilled GDN MTP instead of full attention MTP. Quite successful for MTP1 or 2, but 3 and beyond prediction quality falls off a cliff (needs more training for sure). May actually be useful at the 1 million context use case to sustain throuput speeds towards the big end of kv usage.

[-]

grunt_monkey_@reddit

Congrats sir, and thanks for sharing your repo. May i ask whats the benefit of mxfp4 versus int4-gptq for qwen3.5-122b?

[-]

Sea-Speaker1700@reddit (OP)

Weight accuracy. Scaled FP is more accurate in weight reconstruction.

[-]

chadsly@reddit

This is the kind of post I wish there were more of, actual implementation work instead of benchmark screenshots with no detail. Getting custom kernels working on odd hardware combos is where a lot of the real progress happens. How much gain are you seeing versus the stock path right now?

[-]

Sea-Speaker1700@reddit (OP)

There is no stock pass, MXFP4 is completely non-functional on RDNA unless it's GPTOSS.

[-]

sn2006gy@reddit

I was hopeful AMD would announce native MXPF4 support so we don't end up with the nvidia tax on knowledge for our future. They're part of the open body but i've yet to hear of any product support and NVFP4 success should have lit the fire under AMD's but to think about it.

[-]

Sea-Speaker1700@reddit (OP)

Yeah exactly. I got sick of waiting. I have an nvfp4 dewuant as well, and now that fp8 is working on rdna from other commits I can dewuant it to fp8.

Will be the next kernel I integrate.

[-]

sn2006gy@reddit

This is awesome!

[-]

Sea-Speaker1700@reddit (OP)

Will be working on a bit of kernel optimization, turning into a fused kernel that builds and bypasses zero weight layers in models automatically for sparse models (like Qwen 3.5 122B) that have largely zero value up_proj weights.