A TurboQuant ready llamacpp with gfx906 optimizations for gfx906 users.
Posted by Exact-Cupcake-2603@reddit | LocalLLaMA | View on Reddit | 7 comments
So this is my take on the TurboQuant trend. Its another llamacpp fork, it's vibe coded, but it work like a charm for me so it may interest some. Currently adding Gemma4 architecture support, it will come soon. I am not really aware of benchmark standard in this comunity so feel free to suggest.
stunning_man_007@reddit
Nice work on the fork! I've been curious about TurboQuant but hadn't seen a gfx906-optimized build yet. Would be cool to see how it compares on some of the standard LLM bentchmarks once you get those Gemma4 weights to test with.
Ok_Fish_39@reddit
I won't comment on turbo, but in normal testing your fork was 10% faster than the current best gfx906 solution docker.io/mixa3607/llama.cpp-gfx906:full-b8639-rocm-7.2.0 image . Hopefully your performance tuning will reach all gfx906 AMD MI50/MI60/Radeon VII llama.cpp forks
Exact-Cupcake-2603@reddit (OP)
Glad to read that! Turbo degrades performances so overall it compensate the loss. It's very helpful with tight VRAM fit, can sometime allow to load better quants of a model.
Ok_Fish_39@reddit
If you are bored then have your agent add llama.cpp and gfx906 support to this project https://github.com/AMD-AGI/Apex and 10000 rounds of this can give you amazing results. I ran out of tokens after 100 rounds :)
juss-i@reddit
llama-bench your branch vs standard llama.cpp with ROCm is a good start.
Exact-Cupcake-2603@reddit (OP)
Ok thank you, i will update soon with numbers
No-Refrigerator-1672@reddit
Do not run llama-bench with just default params, set it to test multiple prompt lengths. Llama.cpp has steep performance falloff at long contextes, but by default llama-bench will only test short sequence, which paints wrongly optimistic picture.