How much can you push RTX3090 in terms of Tokens Per Second for Gemma4 E2B?

Posted by last_llm_standing@reddit | LocalLLaMA | View on Reddit | 14 comments

I'm trying to maximize the throuhgput, I can already get gemma-4-E2B-it-GGUF 8bit to give me \~5 tokens per second on my intel i9 cpu. How much can i push this if I get an RTX3090 rtx.

If you are running on CPUs, how much TPS were you able to squish out for Gemma4 (any quant, any model)?

And on RTX3090, how much were you able to push the boundaries?

[-]

Stepfunction@reddit

If you're doing batch processing using vLLM, you'll be able to get several hundred t/s.

[-]

ambient_temp_xeno@reddit

I just tested gemma 4 31b q4_k_m on dual 3060 12gbs and it settled at about 14 t/s with llama.cpp. Only 16k context though.

For 31b q8 it was about 3 t/s with xeon and quad channel ddr4.

[-]

YourNightmar31@reddit

... okay? He's asking about E2B though

[-]

ambient_temp_xeno@reddit

If you are running on CPUs, how much TPS were you able to squish out for Gemma4 (any quant, any model)?

[-]

x0wl@reddit

More than 100 tps. But with 3090, you won't need E2B, you'd be able to use 26B-A4B (also around 100 tps) or 31B at a reasonable 25-30 tps.

On my laptop 5090:
E2B - 190 tps
A26B-A4B - 120 tps
31B - 30 tps

Also just for fun I ran the E2B on CPU (275HX) and got 25 tps, so might be running it wrong on yours.

[-]

wow thats cool, i was running Q8_K_XL from unsloth on llamacpp, since I don't have any local gpus. Maybe thats why im getting low generation tps? also honestly Im more curious running on CPU and experimenting with best quality+speed quant on CPU.

[-]

x0wl@reddit

How much RAM do you have? Ensure it doesn't swap

[-]

last_llm_standing@reddit (OP)

16GB, i am planning to buy addtional, can upgrade to 32

[-]

last_llm_standing@reddit (OP)

My i9 cpu is cascade lake and supports AVX 512 VNNI