gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp
Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 1 comments
I hope sincerely someonecan help me because i have tried everything i can and i get this speed using
ollama.cpp and opencode. I have put as detail i can my setup and how i am running it. I hope someone can help me as its been 1 week non stop 8 hours at day and nothing. i have tested other Q and so on but nothing that give me better speeds.
prompt eval time token 539.91 tokens per second
eval time 5.05 tokens per second
i can see like 2 words coming up per second or so maybe more but feel super slow, and here i read people getting much much faster even with the 24B model and 12 G VRAM. So i f anyone could help me on how to run llama.cpp with gemma e4b or gemma 26B it would make my day.
Hardware : Lenovo legion pro i5
CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
GPU 2: Intel Graphics [Integrated]
Memory: 32 GB
OS linux arch (cachyos)
i have installed llama.cpp-cuda-git and have tried vllm in docker as i dont get it to work in pip env in my laptop.
logs from ollama server
propmt eval time =948.31 ms/512 tokens(1.85 ms per token,539.91 tokens per second)
eval time =66100.04ms/334 tokens(197.90ms per token,5.05 tokens per second)
how i run my model even this small gemma 4 E4B
llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
--n-gpu-layers 999 \
--port 8089 \
--ctx-size 16384 \ # have tried less without any difference
--parallel 1 \
--threads 1 \ # changed this not see much change
--batch-size 1024 \ # changin this and ubatch to much
--ubatch-size 1024 \ # lower gives better results 9t/s
--flash-attn on \
--mlock \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--no-mmproj # i think this is for disable AUDIO/VISION no need for coding
my opencode.json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "llama-server (local)",
"options": {
"baseURL": "http://127.0.0.1:8089/v1",
"headers": {
"Authorization": "Bearer any-key"
}
},
"models": {
"gemma4": {
"name": "Gemma 4 E4B",
"limit": {
"context": 16384,
"output": 4096
},
"extraBody": {
"think": true,
// "reasoning_effort": "none",
"stop": ["<turn|>", "<end_of_turn>", "<eos>"]
}
},
"gemma4-fast": {
"name": "Gemma 4 E4B (Fast)",
"limit": {
"context": 16384,
"output": 4096
},
"extraBody": {
"think": true,
"stop": ["<turn|>", "<end_of_turn>", "<eos>"]
}
}
}
}
},
"model": "ollama/gemma4-fast"
}
Kodix@reddit
The issue isn't the specific settings. The issue is that the model is being ran off of the CPU instead of your GPU. Whatever your model-specific settings are, you will see a *huge* boost in speed when you manage to launch it on your GPU. My toaster could do better than 5t/s on gemma e4b.
I'm not clear on as to *why*, nothing here stands out to me. But my first step would be switching off of ollama and onto llama.cpp, for example.