Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S

Posted by FiniteElemente@reddit | LocalLLaMA | View on Reddit | 15 comments

Hi I'm trying to run Qwen3.6-35B-A3B-GGUF::UD-IQ3_S on my 5070 ti with cuda unified memory but I'm getting jiberish as soon as some memory is off loaded to system RAM.

OS is Ubuntu and I compiled llama cpp myself.

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

cd ~/projects/llama.cpp
rm -rf build

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF -DGGML_CCACHE=OFF

cmake --build /home/llama.cpp/build --config Release -j $(nproc)

And here is my run command

Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
ExecStart=/home/llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF::UD-IQ3_S \
  --host 0.0.0.0 --port 10232 \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.8 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --parallel 1 \
  --flash-attn on \
  --fit on \
  --fit-target 256 \
  --fit-ctx 204800 \
  --no-mmap \
  --mlock \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --kv-offload \
  -b 2048 -ub 2048\
  --reasoning-budget 4096 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --ctx-checkpoints 8 --sleep-idle-seconds 300

Could anyone help point out whether my build or run command is wrong? Thanks!

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+

[-]

CircularSeasoning@reddit

> --cache-type-k q4_0 \

> --cache-type-v q4_0 \

Besides your quant being IQ3 which is already risking quality loss, I think your cache-type-k/v is too low. Try q8_0 if you must, although I personally don't do that either. From my experiments with older Qwen models and trying cache-type-k/v to q8_0, it introduced show-stopping degradation for me compared to full k/v cache (just leave the setting off, default is full).

This might not help with your issue, but more of a friendly tip. You can ignore if you already know this.

grumd@reddit

llama.cpp introduced kv cache rotations, and q8_0 is significantly higher quality now, you should try using it again, it's basically free real estate now

Ok_Mammoth589@reddit

The quality hit on q4 is more slightly worse that the old hot to q8 which everyone said was essentially lossless even before rot attn was implemented

Even before rot attn many people were saying that Qwen 3.5 is more sensitive than other models to kv cache quants and were recommending only using f16 or bf16. I myself tried running q8_0 (before rotattn) and after 50k of context it started making typos and forgetting file paths.

Now with rot attn I'm using q8_0 and didn't notice any issues so far.

Now that is very interesting. There's my motivation!

qwen_next_gguf_when@reddit

Build with cuda on and nothing else.

Certain-Cod-1404@reddit

If i recall correctly cuda 13.1 has issues, maybe try a different version ?

Velocita84@reddit

Did you compile with cuda 13.2?

Try Q4_K_M instead, might help. You can even run Q6 quants with your GPU by offloading experts to RAM btw. Also I'd suggest using q8_0 for kv cache, q4_0 is too low quality. Also Qwen recommends top-p 0.95.

FiniteElemente@reddit (OP)

That has worked! Thx. I didn't realize smaller models could result in jiberish. I've been trouble shooting my setup for hours!

Usually gibberish is some sort of driver/CUDA issue. CUDA 13.2 for example results in gibberish with a lot of IQ quants. But you're using CUDA 13.1 so idk

Thank you. I will give it a try. Currently allocation 30 GB to my VM. Let's see if it's enough. ;)

Great minds quant alike! Is it just me or is the common wisdom of "don't go lower than Q4_K_M" fading? Maybe we need a regular PSA on this.

Bigger models like 122b or 397b are the exception, they are still pretty good. 122b at IQ3_XXS is very usable for many tasks. But yeah for smaller models stick to Q4 or Q6

KringleKrispi@reddit

If you are on cuda13.2 - downgrade to 13.1