Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S

Posted by FiniteElemente@reddit | LocalLLaMA | View on Reddit | 15 comments

Hi I'm trying to run Qwen3.6-35B-A3B-GGUF::UD-IQ3_S on my 5070 ti with cuda unified memory but I'm getting jiberish as soon as some memory is off loaded to system RAM.

OS is Ubuntu and I compiled llama cpp myself.

export CUDA_HOME=/usr/local/cuda
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

cd ~/projects/llama.cpp
rm -rf build

export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=OFF -DGGML_CCACHE=OFF

cmake --build /home/llama.cpp/build --config Release -j $(nproc)

And here is my run command

Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
ExecStart=/home/llama.cpp/build/bin/llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF::UD-IQ3_S \
  --host 0.0.0.0 --port 10232 \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.8 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --parallel 1 \
  --flash-attn on \
  --fit on \
  --fit-target 256 \
  --fit-ctx 204800 \
  --no-mmap \
  --mlock \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --kv-offload \
  -b 2048 -ub 2048\
  --reasoning-budget 4096 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --ctx-checkpoints 8 --sleep-idle-seconds 300

Could anyone help point out whether my build or run command is wrong? Thanks!

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+