gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp

Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 1 comments

I hope sincerely someonecan help me because i have tried everything i can and i get this speed using
ollama.cpp and opencode. I have put as detail i can my setup and how i am running it. I hope someone can help me as its been 1 week non stop 8 hours at day and nothing. i have tested other Q and so on but nothing that give me better speeds.
prompt eval time token 539.91 tokens per second
eval time 5.05 tokens per second
i can see like 2 words coming up per second or so maybe more but feel super slow, and here i read people getting much much faster even with the 24B model and 12 G VRAM. So i f anyone could help me on how to run llama.cpp with gemma e4b or gemma 26B it would make my day.

Hardware : Lenovo legion pro i5 
 CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
 GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
 GPU 2: Intel Graphics [Integrated]
 Memory: 32 GB

OS linux arch (cachyos) 
i have installed llama.cpp-cuda-git and have tried vllm in docker as i dont get it to work in pip env in my laptop.

logs from ollama server

propmt eval time =948.31 ms/512 tokens(1.85 ms per token,539.91 tokens per second)
eval time =66100.04ms/334 tokens(197.90ms per token,5.05 tokens per second)

how i run my model even this small gemma 4 E4B

llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
  --n-gpu-layers 999 \
  --port 8089 \
  --ctx-size 16384 \ # have tried less without any difference
  --parallel 1 \
  --threads 1 \ # changed this not see much change
  --batch-size 1024 \ # changin this and ubatch to much
  --ubatch-size 1024 \ # lower gives better results 9t/s
  --flash-attn on \
  --mlock \
  --no-mmap \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --no-mmproj # i think this is for disable AUDIO/VISION no need for coding

my opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8089/v1",
        "headers": {
          "Authorization": "Bearer any-key"
        }
      },
      "models": {
        "gemma4": {
            "name": "Gemma 4 E4B",
            "limit": {
              "context": 16384,
              "output": 4096
            },
            "extraBody": {
              "think": true,
              // "reasoning_effort": "none",
              "stop": ["<turn|>", "<end_of_turn>", "<eos>"]
            }
          },
        "gemma4-fast": {
          "name": "Gemma 4 E4B (Fast)",
          "limit": {
            "context": 16384,
            "output": 4096
          },
          "extraBody": {
            "think": true,
            "stop": ["<turn|>", "<end_of_turn>", "<eos>"]
          }
        }
      }
    }
  },
  "model": "ollama/gemma4-fast"
}

[-]

Kodix@reddit

The issue isn't the specific settings. The issue is that the model is being ran off of the CPU instead of your GPU. Whatever your model-specific settings are, you will see a *huge* boost in speed when you manage to launch it on your GPU. My toaster could do better than 5t/s on gemma e4b.

I'm not clear on as to *why*, nothing here stands out to me. But my first step would be switching off of ollama and onto llama.cpp, for example.