My settings for running Gemma 4 31B smoothly on llama.cpp, CUDA 13.1

Posted by Oatilis@reddit | LocalLLaMA | View on Reddit | 3 comments

I've had some issues running Gemma 4 31B with llama.cpp, even after updating the model weights, pulling the latest codebase and recompiling everything. I've run into some bugs and troubleshot them one by one until I could finally run autonomous long running tasks.

Hope someone finds this helpful.

The Setup:

Hardware: RTX 6000 Pro 96GB, CUDA 13.1, 128GB RAM (DDR5)

Model: Gemma 4 31B Unsloth GGUF BF16, from April 10th (This is the re-upload).

gguf	md5
gemma-4-31B-it-BF16-00001-of-00002.gguf	6e89e147c3cc8bd39179b401c6321a08
gemma-4-31B-it-BF16-00002-of-00002.gguf	e9a4eb9f09956145b8139f302a49cf93

llama.cpp commit: d132f22fc92f36848f7ccf2fc9987cd0b0120825

My launch script:

#!/bin/bash

export GGML_CUDA_NO_VMM=1

llama-server \
    --model /gemma-4-31B-it/BF16/gemma-4-31B-it-BF16-00001-of-00002.gguf \
    --chat-template-file /models/templates/google-gemma-4-31B-it-interleaved.jinja \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --no-webui \
    --no-mmap \
    --parallel 1 \
    --ctx-size 65576 \
    --flash-attn off

Here's the reason for some of the settings:

These are the recommended parameters from Google:

    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \

This was a lot of trial and error. Apparently there are some bugs in llama.cpp where using memory mapping might not free the model weights from RAM, and this caused OOM when trying to use memory which was apparently free, but crashed in run time:

    --no-mmap \
    --parallel 1 \
    --ctx-size 65576 \

Apparently there is a bug in the llama.cpp CUDA implementation where FA kernel fails to synchronize properly when the context is too large:

    --flash-attn off

These are just for my use case:

    --parallel 1 \
    --ctx-size 65576 \
    --no-webui \

For some cases I also use --reasoning-off to save time.

So this is it, with these settings I got Gemma 4 running pretty well with 64K context length. When I get the chance, I'll try TurboQuant to see if I can get even more context length.

[-]

lakySK@reddit

Having enough memory to run BF16, why use Unsloth instead of the official release?

330d@reddit

What's pretty well? pp/tg?

Oatilis@reddit (OP)

What's pretty well?

It means I finally got to a point where llama.cpp serves the model reliably and doesn't crash on long tasks. Haven't benchmarked performance but maybe I'll look into it later.