How much can you push RTX3090 in terms of Tokens Per Second for Gemma4 E2B?
Posted by last_llm_standing@reddit | LocalLLaMA | View on Reddit | 14 comments
I'm trying to maximize the throuhgput, I can already get gemma-4-E2B-it-GGUF 8bit to give me \~5 tokens per second on my intel i9 cpu. How much can i push this if I get an RTX3090 rtx.
If you are running on CPUs, how much TPS were you able to squish out for Gemma4 (any quant, any model)?
And on RTX3090, how much were you able to push the boundaries?
Stepfunction@reddit
If you're doing batch processing using vLLM, you'll be able to get several hundred t/s.
last_llm_standing@reddit (OP)
on CPU?
Stepfunction@reddit
With a 3090
ambient_temp_xeno@reddit
I just tested gemma 4 31b q4_k_m on dual 3060 12gbs and it settled at about 14 t/s with llama.cpp. Only 16k context though.
For 31b q8 it was about 3 t/s with xeon and quad channel ddr4.
YourNightmar31@reddit
... okay? He's asking about E2B though
ambient_temp_xeno@reddit
x0wl@reddit
More than 100 tps. But with 3090, you won't need E2B, you'd be able to use 26B-A4B (also around 100 tps) or 31B at a reasonable 25-30 tps.
On my laptop 5090:
E2B - 190 tps
A26B-A4B - 120 tps
31B - 30 tps
Also just for fun I ran the E2B on CPU (275HX) and got 25 tps, so might be running it wrong on yours.
last_llm_standing@reddit (OP)
wow thats cool, i was running Q8_K_XL from unsloth on llamacpp, since I don't have any local gpus. Maybe thats why im getting low generation tps? also honestly Im more curious running on CPU and experimenting with best quality+speed quant on CPU.
x0wl@reddit
How much RAM do you have? Ensure it doesn't swap
last_llm_standing@reddit (OP)
16GB, i am planning to buy addtional, can upgrade to 32
TheMasterOogway@reddit
Do a Q4 it'll be much faster on CPU
qwen_next_gguf_when@reddit
With a 4090., I get to 12k prompt processing and 257 for generation.
GeneralEnverPasa@reddit
gemma-4-E2B-it-GGUF Q4_K_M KV Q8 CT= 4096 150/Ts
gemma-4-E2B-it-GGUF Q4_K_M KV Q8 CT= 131072 150/Ts
gemma-4-E2B-it-GGUF Q4_K_M KV Q4 CT= 4096 150/Ts
gemma-4-E2B-it-GGUF Q4_K_M KV Q4 CT= 131072 150/Ts
last_llm_standing@reddit (OP)
My i9 cpu is cascade lake and supports AVX 512 VNNI