What is the highest throughput anyone got with Gemma4 on CPU so far?
Posted by last_llm_standing@reddit | LocalLLaMA | View on Reddit | 15 comments
Wondering if there is any promising quant with high throughput and decent performance?
digamma6767@reddit
I'm not using CPU only, but I have been about to nearly double my tokens per second using speculative decoding.
Using bartowski 31B q6_k_l, and bartowski 26B q6_k_l as my draft model. Getting between a 60-70% acceptance rate and about 17 token per second (up from 9).
It feels like I'm using Qwen 3.5 122B in performance and intelligence, but with much less RAM usage.
Running on a 128GB Strix Halo.
digamma6767@reddit
Did some more testing on this. Doing agentic or code, acceptance rate increases to 80-90%, and tokens per second up to 17.
Ok_Mammoth589@reddit
What command did you use?
digamma6767@reddit
The -md command (short for --draft-model) in llama.cpp, to use the 26B as my draft model.
Effectively, it's loading both Gemma 4 31B and 26B at the same time. Works great if you can fit it into memory!
Ok_Mammoth589@reddit
I understand what switches are available. It's not what I'm trying to get after.
digamma6767@reddit
Are you asking about the benchmarks and tools I use? Not sure what else you're after.
For a rough estimate, I use the Aider polyglot benchmark. That gave me 17 tps consistently. It's a decent benchmark for seeing how quantization impacts the model.
When doing agent work (Primarily Kilo Code and Hermes-Agent) I get anywhere from 13-16 tps, compared to 9 tps using the draft model.
Just chatting to the model, I get 12-14 tps.
I need to revisit all this stuff with the latest updates though. Lots happened in just the last few days for Gemma 4.
ikkiyikki@reddit
Not terribly useful without mentioning which model. Here's 31b on a linux box with two 6000 pros.
Ps. not that impressed with any of the Gemma4's tbh
ormandj@reddit
41 tok/s seems awfully low for two 6000s.
LegacyRemaster@reddit
I have RTX 6000 96gb. Q6_K Lmstudio PC about 47token/sec. Minimax 2.5 Q4_K_XL 78token/sec. So... Better Minimax for sure.
MelodicRecognition7@reddit
for dense models the highest throughput you could theoretically get is your computer's memory bandwidth divided by model size, for MoE the highest throughput you could theoretically get is memory bandwidth divided by size of active parameters in GB, read this to get some basic understanding: https://old.reddit.com/r/LocalLLaMA/comments/1rqo2s0/can_i_run_this_model_on_my_hardware/?
last_llm_standing@reddit (OP)
Thanks for sharing!
lemondrops9@reddit
Gemma 4 26B Q4 KM on a 3090 - 100 tk/s
last_llm_standing@reddit (OP)
sir this is a CPU
Betadoggo_@reddit
I know I'm nowhere near the fastest but I'll put my number here for reference:
On a ryzen 5 3600 with 64GB of ddr4 running at 2933 I'm getting roughly
8-11t/swithin 8k context using the official q4_k_m 26BA4 from ggml org with the following arguments in llama server:--parallel 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 --models-preset config.iniNo idea if the speculative arguments are working with gemma4, they're there for other models.
last_llm_standing@reddit (OP)
What were your specs and what quant did you use?