gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM

Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 4 comments

I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error :
Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
so i guess i dont have memory . but I have seen people run gemma even 26B on 12 GBvram withou any issues and good speeds. So i dont have any idea what i am doing wrong please help.

running a quantize model like prithivMLmods/gemma-4-E2B-it-FP8 it get stuck in:

vllm-1  | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
vllm-1  | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

Hardware : Lenovo legion pro i5

CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
GPU 2: Intel Graphics [Integrated]
Memory: 32 GB
OS linux arch (cachyos)
i have tried vllm in docker as i dont get it to work in pip env in my laptop.

docker-compose.yml

version: "3.8"
services:
  vllm:
    # build: .
    image: vllm/vllm-openai:gemma4-cu130
    ports:
      - "8000:8000"
    volumes:
      - model-cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model goole/gemma-4-E2B-it
      --host 0.0.0.0
      --port 8000
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --dtype bfloat16
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
volumes:
  model-cache:

logs from docker compose -f vllm:

ValueError: Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
vllm-1  | [rank0]:[W416 09:04:45.775515380 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I have even decrease gpu-memmory-utiliztion and i get then error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 11.50 GiB of which 75.44 MiB is free. Including non-PyTorch memory, this process has 9.79 GiB memory in use. Of the allocated memory 9.47 GiB is allocated by PyTorch, and 68.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)