Is anyone able to successfully run Qwen 30B Coder BF16?
Posted by TokenRingAI@reddit | LocalLLaMA | View on Reddit | 20 comments
With Llama.cpp and the Unsloth GGUFs for Qwen 30B Coder BF16, I am getting frequent crashes on two entirely different systems, a Ryzen AI Max, and a RTX 6000 Blackwell.
Llama.cpp just exits with no error message after a few messages.
VLLM works perfectly on the Blackwell with the official model from Qwen, except tool calling is currently broken, even with the new qwen 3 tool call parser which VLLM added. So the tool call instructions just end up in the chat stream.
enonrick@reddit
no problem with my set rtx 8000 + rtx a6000 with llama.cpp(8ff2060)
TokenRingAI@reddit (OP)
Here's my docker compose file:
After the latest update to the server-cuda container which came out at midnight, I am now getting this error on the Blackwell, whereas before it just exited with no error or dmesg trap
dmesg
enonrick@reddit
can’t help much since I don’t use Docker. it looks like either
llama.cpp
or the CUDA image has compatibility problem or abi conflict with kernel. try build a fresh llama.cppTokenRingAI@reddit (OP)
Set the context length shorter and now i'm getting this on the Blackwell:
Secure_Reflection409@reddit
Is 30b-coder actually of 2507 ilk? It feels worse.
DistanceSolar1449@reddit
-ts 1,1 is lol
At 256k tokens max context you need only 70GB. You’re better off with -ts 1,2 or -ts 2,1 to fill the A6000
TokenRingAI@reddit (OP)
Is that the unsloth GGUF?
enonrick@reddit
yes , from unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
Secure_Reflection409@reddit
How are you doing tools? Is it via roo? A chap posted a roo specific fix which finally allowed 30b coder to work consistently for me.
TokenRingAI@reddit (OP)
No, through the openai compatible tool API
DeltaSqueezer@reddit
what parameters are you using to start vLLM? tool calling works fine for me.
TokenRingAI@reddit (OP)
DeltaSqueezer@reddit
you might want to try the hermes tool call parser instead of qwen3_coder
TokenRingAI@reddit (OP)
It just dumps the tool calls into the chat stream with either tool call parser
TokenRingAI@reddit (OP)
Here is the bug:
https://github.com/vllm-project/vllm/issues/22975
complead@reddit
It might help to check if the crashes are related to memory limits on your systems. Llama.cpp can be memory-heavy, so try lowering the context size. Also, ensure you're using the latest version of llama.cpp as there might be bug fixes or optimizations that address these issues. Another angle is testing with different configuration flags to see if specific settings are causing the issue.
Marksta@reddit
Found a Qwen 30B right here in the comments!
DistanceSolar1449@reddit
-ts 1,1 is lol
At 256k tokens max context you need only 70GB. You’re better off with -ts 1,2 or -ts 2,1 to fill the A6000
NNN_Throwaway2@reddit
I run it through LMStudio on a 7900XTX and 7900X with 96GB RAM. I have not used the tool-calling capabilities, however.
RagingAnemone@reddit
I am on my mac: llama-server --jinja -m models/Qwen3-Coder-30B-A3B-Instruct-1M-BF16.gguf -c 32768 -ngl 60 --temp 0.7 --top-p 0.8 --top-k 20 --repeat_penalty 1.05 -n 65556 --port 8000 --host 0.0.0.0