Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S
Posted by FiniteElemente@reddit | LocalLLaMA | View on Reddit | 15 comments
Hi I'm trying to run Qwen3.6-35B-A3B-GGUF::UD-IQ3_S on my 5070 ti with cuda unified memory but I'm getting jiberish as soon as some memory is off loaded to system RAM.
OS is Ubuntu and I compiled llama cpp myself.
export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
cd ~/projects/llama.cpp
rm -rf build
export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF -DGGML_CCACHE=OFF
cmake --build /home/llama.cpp/build --config Release -j $(nproc)
And here is my run command
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
ExecStart=/home/llama.cpp/build/bin/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF::UD-IQ3_S \
--host 0.0.0.0 --port 10232 \
--temp 0.7 \
--top-k 20 \
--top-p 0.8 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--parallel 1 \
--flash-attn on \
--fit on \
--fit-target 256 \
--fit-ctx 204800 \
--no-mmap \
--mlock \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--kv-offload \
-b 2048 -ub 2048\
--reasoning-budget 4096 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--ctx-checkpoints 8 --sleep-idle-seconds 300
Could anyone help point out whether my build or run command is wrong? Thanks!
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
CircularSeasoning@reddit
> --cache-type-k q4_0 \
> --cache-type-v q4_0 \
Besides your quant being IQ3 which is already risking quality loss, I think your cache-type-k/v is too low. Try q8_0 if you must, although I personally don't do that either. From my experiments with older Qwen models and trying cache-type-k/v to q8_0, it introduced show-stopping degradation for me compared to full k/v cache (just leave the setting off, default is full).
This might not help with your issue, but more of a friendly tip. You can ignore if you already know this.
grumd@reddit
llama.cpp introduced kv cache rotations, and q8_0 is significantly higher quality now, you should try using it again, it's basically free real estate now
Ok_Mammoth589@reddit
The quality hit on q4 is more slightly worse that the old hot to q8 which everyone said was essentially lossless even before rot attn was implemented
grumd@reddit
Even before rot attn many people were saying that Qwen 3.5 is more sensitive than other models to kv cache quants and were recommending only using f16 or bf16. I myself tried running q8_0 (before rotattn) and after 50k of context it started making typos and forgetting file paths.
Now with rot attn I'm using q8_0 and didn't notice any issues so far.
CircularSeasoning@reddit
Now that is very interesting. There's my motivation!
qwen_next_gguf_when@reddit
Build with cuda on and nothing else.
Certain-Cod-1404@reddit
If i recall correctly cuda 13.1 has issues, maybe try a different version ?
Velocita84@reddit
Did you compile with cuda 13.2?
grumd@reddit
Try Q4_K_M instead, might help. You can even run Q6 quants with your GPU by offloading experts to RAM btw. Also I'd suggest using q8_0 for kv cache, q4_0 is too low quality. Also Qwen recommends top-p 0.95.
FiniteElemente@reddit (OP)
That has worked! Thx. I didn't realize smaller models could result in jiberish. I've been trouble shooting my setup for hours!
grumd@reddit
Usually gibberish is some sort of driver/CUDA issue. CUDA 13.2 for example results in gibberish with a lot of IQ quants. But you're using CUDA 13.1 so idk
FiniteElemente@reddit (OP)
Thank you. I will give it a try. Currently allocation 30 GB to my VM. Let's see if it's enough. ;)
CircularSeasoning@reddit
Great minds quant alike! Is it just me or is the common wisdom of "don't go lower than Q4_K_M" fading? Maybe we need a regular PSA on this.
grumd@reddit
Bigger models like 122b or 397b are the exception, they are still pretty good. 122b at IQ3_XXS is very usable for many tasks. But yeah for smaller models stick to Q4 or Q6
KringleKrispi@reddit
If you are on cuda13.2 - downgrade to 13.1