A TurboQuant ready llamacpp with gfx906 optimizations for gfx906 users.

Posted by Exact-Cupcake-2603@reddit | LocalLLaMA | View on Reddit | 7 comments

So this is my take on the TurboQuant trend. Its another llamacpp fork, it's vibe coded, but it work like a charm for me so it may interest some. Currently adding Gemma4 architecture support, it will come soon. I am not really aware of benchmark standard in this comunity so feel free to suggest.

[-]

stunning_man_007@reddit

Nice work on the fork! I've been curious about TurboQuant but hadn't seen a gfx906-optimized build yet. Would be cool to see how it compares on some of the standard LLM bentchmarks once you get those Gemma4 weights to test with.

[-]

Ok_Fish_39@reddit

I won't comment on turbo, but in normal testing your fork was 10% faster than the current best gfx906 solution docker.io/mixa3607/llama.cpp-gfx906:full-b8639-rocm-7.2.0 image . Hopefully your performance tuning will reach all gfx906 AMD MI50/MI60/Radeon VII llama.cpp forks

[-]

Exact-Cupcake-2603@reddit (OP)

Glad to read that! Turbo degrades performances so overall it compensate the loss. It's very helpful with tight VRAM fit, can sometime allow to load better quants of a model.

[-]

Ok_Fish_39@reddit

If you are bored then have your agent add llama.cpp and gfx906 support to this project https://github.com/AMD-AGI/Apex and 10000 rounds of this can give you amazing results. I ran out of tokens after 100 rounds :)

[-]

juss-i@reddit

I am not really aware of benchmark standard in this comunity so feel free to suggest.

llama-bench your branch vs standard llama.cpp with ROCm is a good start.

[-]

Exact-Cupcake-2603@reddit (OP)

Ok thank you, i will update soon with numbers

[-]

No-Refrigerator-1672@reddit

Do not run llama-bench with just default params, set it to test multiple prompt lengths. Llama.cpp has steep performance falloff at long contextes, but by default llama-bench will only test short sequence, which paints wrongly optimistic picture.