gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp

Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 11 comments

I hope sincerely someonecan help me because i have tried everything i can and i get this speed using
ollama.cpp and opencode. I have put as detail i can my setup and how i am running it. I hope someone can help me as its been 1 week non stop 8 hours at day and nothing. i have tested other Q and so on but nothing that give me better speeds.
prompt eval time token 539.91 tokens per second
eval time 5.05 tokens per second
i can see like 2 words coming up per second or so maybe more but feel super slow, and here i read people getting much much faster even with the 24B model and 12 G VRAM. So i f anyone could help me on how to run llama.cpp with gemma e4b or gemma 26B it would make my day.

Hardware : Lenovo legion pro i5 
 CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
 GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
 GPU 2: Intel Graphics [Integrated]
 Memory: 32 GB

OS linux arch (cachyos) 
i have installed llama.cpp-cuda-git and have tried vllm in docker as i dont get it to work in pip env in my laptop.

logs from ollama server

propmt eval time =948.31 ms/512 tokens(1.85 ms per token,539.91 tokens per second)
eval time =66100.04ms/334 tokens(197.90ms per token,5.05 tokens per second)

how i run my model even this small gemma 4 E4B

llama-server -hf unsloth/gemma-4-E4B-it-GGUF:Q4_K_M \
  --n-gpu-layers 999 \
  --port 8089 \
  --ctx-size 16384 \ # have tried less without any difference
  --parallel 1 \
  --threads 1 \ # changed this not see much change
  --batch-size 1024 \ # changin this and ubatch to much
  --ubatch-size 1024 \ # lower gives better results 9t/s
  --flash-attn on \
  --mlock \
  --no-mmap \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --no-mmproj # i think this is for disable AUDIO/VISION no need for coding

my opencode.json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8089/v1",
        "headers": {
          "Authorization": "Bearer any-key"
        }
      },
      "models": {
        "gemma4": {
            "name": "Gemma 4 E4B",
            "limit": {
              "context": 16384,
              "output": 4096
            },
            "extraBody": {
              "think": true,
              // "reasoning_effort": "none",
              "stop": ["<turn|>", "<end_of_turn>", "<eos>"]
            }
          },
        "gemma4-fast": {
          "name": "Gemma 4 E4B (Fast)",
          "limit": {
            "context": 16384,
            "output": 4096
          },
          "extraBody": {
            "think": true,
            "stop": ["<turn|>", "<end_of_turn>", "<eos>"]
          }
        }
      }
    }
  },
  "model": "ollama/gemma4-fast"
}

[-]

No_Fee_2726@reddit

that hardware is way more than enough for the e4b model so you should be getting insane speeds. are you sure you have the right drivers and are actually running it on the gpu???? sometimes with new laptops it defaults to the integrated graphics or the system power settings are throttling the card to keep the fan noise down. check your task manager while you prompt it and see if the gpu usage is actually spiking. also make sure you are using a quant that is optimized for your specific architecture because if it is hitting system ram you are going to see a massive performance cliff.

[-]

Plastic-Parsley3094@reddit (OP)

Yeah my thoughts exactly but is not running on the integrated card as I have disable it in bios. And the gpu in the screen shot in previous answer you can see it's using the gpu and I have 8 cpu cores and 12 other cores and it's using only one of them at 100% and the gpu around 5 GB I think. I don't know what kind of model I can use that is optimized for my gpu. Also maybe those guff models are made for run on cpu but I don't find other that run only on gpu.

[-]

Jester14@reddit

CUDA 13.2 has known bugs.

[-]

sersoniko@reddit

The driver doesn’t really matter, what’s important is what CUDA toolkit version was used to compile llama.cpp, and they use an old version of CUDA 12

[-]

Plastic-Parsley3094@reddit (OP)

thank you.. so if the llama.cpp was compiled with cuda 12 then should i downgrade to 12? i have this installed

paru -Qs cuda
local/cuda 13.1.1-1
    NVIDIA's GPU programming toolkit
local/cudnn 9.20.0.48-1.1
    NVIDIA CUDA Deep Neural Network library
local/icu 78.3-1.1
    International Components for Unicode library
local/lib32-icu 78.3-1
    International Components for Unicode library (32 bit)
local/llama.cpp-cuda-git b8783.r10.1f30ac0cea-1
    Port of Facebook's LLaMA model in C/C++ (with NVIDIA CUDA optimizations)

[-]

sersoniko@reddit

No no, you actually don’t care what they used to compile it, I was just pointing out the driver you have isn’t the issue

[-]

Plastic-Parsley3094@reddit (OP)

ah i understand... sorry im kind of noob running things locally and i write here as i have ask gemini pro, microsoft copilot , Leo ai, caht pgt and still cant get more than a few tokens/s and they suggest flags and commands that dont exist and so on.. i have spend more time trying everything that all ai has suggested than actually get some work done. that is why my last option is to write here. and hope i can get some help getting this to run on this hardware in a way i can use at as daily drive for python, vite projects

[-]

Plastic-Parsley3094@reddit (OP)

thank you. should i run cuda 12 or 13.1? or 13.2.1?

[-]

terablast@reddit

Did you check that it's actually running on your GPU?

Like, if you run nvidia-smi while the model is being ran, are your Memory-Usage and GPU-Util columns high?

[-]

Plastic-Parsley3094@reddit (OP)

here is the screenshot running that model on the left side withou opencode, in the middle are the logs and the right are up nvidia-msi and down nvtop that show it uses 5328MB gpu

has anyone successfully run on similar gpu?

[-]

FinBenton@reddit

Prob running on integrated gpu or cpu, check the command to spesify the gpu you want in the launch commands.