Is anyone able to successfully run Qwen 30B Coder BF16?

Posted by TokenRingAI@reddit | LocalLLaMA | View on Reddit | 20 comments

With Llama.cpp and the Unsloth GGUFs for Qwen 30B Coder BF16, I am getting frequent crashes on two entirely different systems, a Ryzen AI Max, and a RTX 6000 Blackwell.

Llama.cpp just exits with no error message after a few messages.

VLLM works perfectly on the Blackwell with the official model from Qwen, except tool calling is currently broken, even with the new qwen 3 tool call parser which VLLM added. So the tool call instructions just end up in the chat stream.

[-]

enonrick@reddit

no problem with my set rtx 8000 + rtx a6000 with llama.cpp(8ff2060)

\~/llama.cpp/llama-server \

--model \~/models/qwen3-coder-2507/BF16/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \

--no-webui \

--jinja \

--host 192.168.1.6 \

--temp 0.7 \

--port 10000 \

--ctx-size 0 \

--min-p 0.0 \

--top-p 0.8 \

--top-k 20 \

--presence_penalty 1.05 \

--no-mmap \

-ts 1,1 \

-kvu \

-fa auto

[-]

TokenRingAI@reddit (OP)

Here's my docker compose file:

version: '3.8'

services:
  llama:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    command: -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:BF16 --jinja --host 0.0.0.0 --port 11434 -ngl 99 -c 250000 -fa on --no-mmap
    volumes:
      - /mnt/media/llama-cpp:/root/
    networks:
      - host

networks:
  host:
    external: true
    name: host

After the latest update to the server-cuda container which came out at midnight, I am now getting this error on the Blackwell, whereas before it just exited with no error or dmesg trap

main: server is listening on http://0.0.0.0:11434 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /api/tags 127.0.0.1 200
srv  log_server_r: request: GET /models 192.168.15.122 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 250112, n_keep = 0, n_prompt_tokens = 1168
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1168, n_tokens = 1168, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1168, n_tokens = 1168
srv    operator(): operator(): cleaning up before exit...
libggml-base.so(+0x1838b)[0x7fa751f7338b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7fa751f737ef]
libggml-base.so(+0x2b1ef)[0x7fa751f861ef]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fa751cae20c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7fa751cae277]
/app/llama-server(+0xee895)[0x558cb460e895]
/app/llama-server(+0x6f007)[0x558cb458f007]
/app/llama-server(+0x7a863)[0x558cb459a863]
/app/llama-server(+0xbc56a)[0x558cb45dc56a]
/app/llama-server(+0x10c990)[0x558cb462c990]
/app/llama-server(+0x10e7b0)[0x558cb462e7b0]
/app/llama-server(+0x8bbd5)[0x558cb45abbd5]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7fa751cdc253]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7fa751894ac3]
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7fa751925a04]
terminate called without an active exception

dmesg

[51643.946569] traps: llama-server[122730] general protection fault ip:7fa751828898 sp:7fa505ff7ac0 error:0 in libc.so.6[28898,7fa751828000+195000]

[-]

enonrick@reddit

can’t help much since I don’t use Docker. it looks like either llama.cpp or the CUDA image has compatibility problem or abi conflict with kernel. try build a fresh llama.cpp

[-]

TokenRingAI@reddit (OP)

Set the context length shorter and now i'm getting this on the Blackwell:

slot get_availabl: id  0 | task 114 | selected slot by lcs similarity, lcs_len = 6707, similarity = 0.994 (> 0.100 thold)
slot launch_slot_: id  0 | task 178 | processing task
slot update_slots: id  0 | task 178 | new prompt, n_ctx_slot = 100096, n_keep = 0, n_prompt_tokens = 7005
slot update_slots: id  0 | task 178 | kv cache rm [6707, end)
slot update_slots: id  0 | task 178 | prompt processing progress, n_past = 7005, n_tokens = 298, progress = 0.042541
slot update_slots: id  0 | task 178 | prompt done, n_past = 7005, n_tokens = 298
slot      release: id  0 | task 178 | stop processing: n_past = 7066, truncated = 0
slot print_timing: id  0 | task 178 | 
prompt eval time =    3114.64 ms /   298 tokens (   10.45 ms per token,    95.68 tokens per second)
       eval time =     502.14 ms /    62 tokens (    8.10 ms per token,   123.47 tokens per second)
      total time =    3616.78 ms /   360 tokens
libggml-base.so(+0x1838b)[0x7ff55537a38b]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7ff55537a7ef]
libggml-base.so(+0x2b1ef)[0x7ff55538d1ef]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7ff554cae20c]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7ff554cae277]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7ff554cae4d8]
/app/llama-server(+0x3689e)[0x55b45011089e]
/app/llama-server(+0xb62da)[0x55b4501902da]
/app/llama-server(+0xc0df4)[0x55b45019adf4]
/app/llama-server(+0xeafae)[0x55b4501c4fae]
/app/llama-server(+0x8ce1d)[0x55b450166e1d]
/app/llama-server(+0x52d80)[0x55b45012cd80]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff554829d90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff554829e40]
/app/llama-server(+0x547e5)[0x55b45012e7e5]
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: now finding less tool calls!

[-]

Secure_Reflection409@reddit

Is 30b-coder actually of 2507 ilk? It feels worse.

[-]

DistanceSolar1449@reddit

-ts 1,1 is lol

At 256k tokens max context you need only 70GB. You’re better off with -ts 1,2 or -ts 2,1 to fill the A6000

[-]

TokenRingAI@reddit (OP)

Is that the unsloth GGUF?

[-]

enonrick@reddit

yes , from unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

[-]

Secure_Reflection409@reddit

How are you doing tools? Is it via roo? A chap posted a roo specific fix which finally allowed 30b coder to work consistently for me.

[-]

TokenRingAI@reddit (OP)

No, through the openai compatible tool API

[-]

DeltaSqueezer@reddit

what parameters are you using to start vLLM? tool calling works fine for me.

[-]

TokenRingAI@reddit (OP)

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    command: --model Qwen/Qwen3-Coder-30B-A3B-Instruct  --max-model-len 256000 --enable-auto-tool-choice --tool-call-parser qwen3_coder --port 11434 --gpu-memory-utilization 0.95
    volumes:
      - data:/root/
      - type: tmpfs
        target: /dev/shm
        tmpfs:
          size: 99000000000 # (this means 99GB)
    networks:
      - host
    shm_size: 99g

volumes:
  data:

networks:
  host:
    external: true
    name: host

[-]

DeltaSqueezer@reddit

you might want to try the hermes tool call parser instead of qwen3_coder

[-]

TokenRingAI@reddit (OP)

It just dumps the tool calls into the chat stream with either tool call parser

user > /foreach pkg/*/README.md start up a brainstorm agent, and instruct it to review all the code in the package, retrieve any tokenring-ai/ dependencies it imports, and brainstorm new ideas for 
the product
Running prompt on file: pkg/agent/README.md
[runChat] Using model Qwen/Qwen3-Coder-30B-A3B-Instruct
<function=agent_run>
<parameter=agentType>
brainstorm
</parameter>
<parameter=message>
Review all the code in the package, retrieve any token-ring/ dependencies it imports, and brainstorm new ideas for the product.
</parameter>
</function>
</tool_call>

[-]

TokenRingAI@reddit (OP)

Here is the bug:

https://github.com/vllm-project/vllm/issues/22975

[-]

complead@reddit

It might help to check if the crashes are related to memory limits on your systems. Llama.cpp can be memory-heavy, so try lowering the context size. Also, ensure you're using the latest version of llama.cpp as there might be bug fixes or optimizations that address these issues. Another angle is testing with different configuration flags to see if specific settings are causing the issue.

[-]

Marksta@reddit

Found a Qwen 30B right here in the comments!

[-]

DistanceSolar1449@reddit

-ts 1,1 is lol

At 256k tokens max context you need only 70GB. You’re better off with -ts 1,2 or -ts 2,1 to fill the A6000

[-]

NNN_Throwaway2@reddit

I run it through LMStudio on a 7900XTX and 7900X with 96GB RAM. I have not used the tool-calling capabilities, however.

[-]

RagingAnemone@reddit

I am on my mac: llama-server --jinja -m models/Qwen3-Coder-30B-A3B-Instruct-1M-BF16.gguf -c 32768 -ngl 60 --temp 0.7 --top-p 0.8 --top-k 20 --repeat_penalty 1.05 -n 65556 --port 8000 --host 0.0.0.0