For those with a 6700XT GPU (gfx1031) - ROCM - Openweb UI

Posted by uber-linny@reddit | LocalLLaMA | View on Reddit | 23 comments

Just thought i would share my setup for those starting out or need some improvement, as I think its as good as its going to get. For context I have a 6700XT with a 5600x 16GB system, and if there's any better/faster ways I'm open to suggestions.

Between all the threads of information and little goldmines along the way, I need to share some links and let you know that Google Studio AI was my friend in getting a lot of this built for my system.

I have ROCm 7.1.1 built : https://github.com/guinmoon/rocm7_builds -with gfx1031 ROCBLas https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU
I build my own llama.cpp aligned to use the gfx1031 6700XT and ROCm 7.1.1
I use llama-swap for my models : https://github.com/mostlygeek/llama-swap as you can still use Vision Models by defining the mmproj file.
I use Openweb UI in a docker https://github.com/open-webui/open-webui
I install from github Fast Kokoro - ONNX : https://github.com/thewh1teagle/kokoro-onnx (pip install --force-reinstall "git+https://github.com/thewh1teagle/kokoro-onnx.git")
I build Whisper.cpp - Vulkan /w VAD: https://github.com/ggml-org/whisper.cpp/tree/master?tab=readme-ov-file#vulkan-gpu-support & modify server.cpp "/inference" to "/v1/audio/transcriptions"
I run Docling via python : pip install "docling-serve[ui]" #to upgrade : pip install --upgrade "docling-serve[ui]"

I had to install python 3.12.x to get ROCm built , yes i know my ROCm is butchered , but i don't know what im doing and its working , but it looks like 7.1.1 is being used for Text Generation and the Imagery ROCBlas is using 6.4.2 /bin/library.

I have my system so that I have *.bat file that starts up each service on boot as its own CMD window & runs in the background ready to be called by Openweb UI. I've tried to use python along the way as Docker seems to take up lot of resources. but tend to get between 22-25 t/s on ministral3-14b-instruct Q5_XL with a 16k context.

Also got Stablediffusion.cpp working with Z-Image last night using the same custom build approach

If your having trouble DM me , or i might add it all to a github later so that it can be shared.

[-]

PssyGotWifi@reddit

Hey bud, I have a 6700XT sitting around collecting dust for over a year now (since upgrading to 9070XT). Is this worth setting up? or would i better off just trying to sell the card?

[-]

uber-linny@reddit (OP)

Sell it ... I sold and went to a 9070xt too.

The amount of issues I kept having with the custom rocm builds.. sure it would work . Then it would randomly stop working and you would have to rebuild from scratch.. was like a fortnightly thing. Just get another rnda 4 card. For more ram . But it's probably cheaper to just use API

[-]

alecz20@reddit

Thanks a lot for this post and for the builds.
I downloaded and installed the builds, but pytorch only detects CPU:
```
torch.cuda.is_available()
False
```
Did I mess up the installation? How do you validated that pytorch can now use the AMD GPU?

[-]

MelodicFuntasy@reddit

I have ROCm 7.1.1 built : https://github.com/guinmoon/rocm7_builds -with gfx1031 ROCBLas https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU

Can you explain this part? What's different or better about it than installing the normal ROCm build?

[-]

uber-linny@reddit (OP)

hey,

A1. when installing ROCM 7.1.1 text generation worked perfectly , but when i started using vision models and using the --mmproj , ROCBlas was failing. adding the 6.4.2 library in parent directory of llama.cpp seemed to fix that.

the approach was similar to how you pull ROCm release from llama.cpp as it lines up with A2.

A2. just did it to control the versioning that ensures that llama.cpp is using what i think it should be using , as im not confident that theyre using 7.1.1 but 6.4.2. But OldBOX mentioned that theres not much performance gained anyways , and im pretty sure 7.1.1 performance is more in the text prompt processing.

[-]

MelodicFuntasy@reddit

Thanks for the explanation! I have the same card, so I was wondering how my setup is different from yours and if it could be improved. The only difference is that I'm on GNU/Linux. I remember that I've had some issues with some models in llama.cpp, maybe that was related, but I can't be sure. I will keep that in mind if I see any problems in the future!

By the way, do you know anything about using FlashAttention with this GPU? I don't do much stuff with LLMs, I mostly use it in ComfyUI for image generation and such. I tried to use FlashAttention Triton (since I think that's the only version that works on RDNA cards) there with PyTorch (which comes with its own version of ROCm) and it always seemed to only slow things down, instead of speeding things up. Maybe this card is just too old to benefit from it, but I was wondering if you know anything about this. I also tried to use FlashAttention in llama.cpp with the commandline parameter, but I can't remember if it was faster.

[-]

uber-linny@reddit (OP)

I only use it llama.cpp . But it's to enable the kv_cache to q8 so I can free up some ram for context . When I go below q8 . I noticed some models didn't like it and it did slow down or didn't work.

[-]

Old_Box_5438@reddit

tensile templates for rocblas kernels for rdna2 haven't changed at all in like 3 years, so you shouldn't be leaving much performance on the table if you reuse the 6.4.2 kernels in 7.1.1 for your card (just tried 6.4.2 kernels from that gh repo with 7.1.1 rocm on 680m, got practically the same pp/tg in llama.cpp as 7.1.1 kernels). You can also make rocblas use 6900xt kernels from 7.1.1 rocm with env. variable HSA_OVERRIDE_GFX_VERSION="10.3.0" to check if there is any difference, they're supposed to be interchangeable across all rdna2

[-]

MelodicFuntasy@reddit

RX 6700 XT has never worked for me without that environment variable, I think it's required. I don't think ROCm supports gfx1031.

[-]

Old_Box_5438@reddit

you need that env variable if your copy of rocblas doesn't have kernels that were compiled for your GPU. OP replaced his 7.1.1 kernels with the ones compiled for 6700xt for 6.4.2, so it works without GPU version override. To compile kernels for 6700xt you can reuse tensile templates for 6900xt by changing isa number and gfx version and adding your GPU number in a few places in tensile/rocblas source code. You can find more details in this PR: https://github.com/ROCm/rocm-libraries/pull/1943

[-]

MelodicFuntasy@reddit

Thank you for explaining! But as you said, this doesn't give any significant performance benefit in AI inference?

[-]

Old_Box_5438@reddit

My understanding is there shouldn't be much performance benefit, since all rdna2 cards are reusing the same 3yo navi21 (6900xt) tensile templates, you just avoid having to use the env. variable override

[-]

MelodicFuntasy@reddit

I see. Since you are knowledgeable on this stuff, do you know anything about using FlashAttention in PyTorch? Whenever I tried to use FlashAttention Triton on my RX 6700 XT in ComfyUI, it would end up being slower, instead of speeding things up.

[-]

Old_Box_5438@reddit

No idea tbh, I only learned all this random stuff about rdna2 kernels cause I spent too much time trying to compile rocm and llama.cpp for my 680m lol. It may have something to do with rdna2 lacking wmma instructions, but I never really looked much into it

[-]

MelodicFuntasy@reddit

No worries :D. I've heard of people using SageAttention on Nvidia RTX 3000 series, which is the closes competitor from Nvidia and it kinda makes me jealous that we don't get similar speedups on RDNA 2 :). The lack of WMMA instructions is the explanation I've seen mentioned by other people, so maybe it's true.

[-]

wesmo1@reddit

I would be interested to take a squiz if you set up a GitHub

[-]

uber-linny@reddit (OP)

o0LINNY0o/Local-AI-Stack_RX-6700-XT-ROCm-7.x.: Repository of my 6700XT GFX1031 (ROCm 7.1.1) Configuration files

[-]

wesmo1@reddit

This is really interesting, have you tried using ik_llama with your system to offload the experts on a larger moe model?

[-]

uber-linny@reddit (OP)

I thought about it but might go down that rabbit hole later. Because i only have 16GB ram and 12GB of VRAM , i still think i will be having difficulties fitting a decent model on.

[-]

wesmo1@reddit

Try the q2_k quants of qwen 3 vl 30b a3b and work your way up in quant size. Just make sure you are testing with realistic context size for your use cases. You can also play around with the newer Nvidia nemotron cascade 14b dense model

[-]

uber-linny@reddit (OP)

After building with vulkan , Also looks like my system is just too small for a 20b model

[-]

uber-linny@reddit (OP)

doesnt look like i can build it ,,, its getting stuck. unless its going to work , im going to give up on this idea

[-]