gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM

Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 4 comments

I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error :
Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
so i guess i dont have memory . but I have seen people run gemma even 26B on 12 GBvram withou any issues and good speeds. So i dont have any idea what i am doing wrong please help.

running a quantize model like prithivMLmods/gemma-4-E2B-it-FP8 it get stuck in:

vllm-1  | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
vllm-1  | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.

Hardware : Lenovo legion pro i5

CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
GPU 2: Intel Graphics [Integrated]
Memory: 32 GB
OS linux arch (cachyos)
i have tried vllm in docker as i dont get it to work in pip env in my laptop.

docker-compose.yml

version: "3.8"
services:
  vllm:
    # build: .
    image: vllm/vllm-openai:gemma4-cu130
    ports:
      - "8000:8000"
    volumes:
      - model-cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model goole/gemma-4-E2B-it
      --host 0.0.0.0
      --port 8000
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --dtype bfloat16
      --trust-remote-code
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
volumes:
  model-cache:

logs from docker compose -f vllm:

ValueError: Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
vllm-1  | [rank0]:[W416 09:04:45.775515380 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

I have even decrease gpu-memmory-utiliztion and i get then error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 11.50 GiB of which 75.44 MiB is free. Including non-PyTorch memory, this process has 9.79 GiB memory in use. Of the allocated memory 9.47 GiB is allocated by PyTorch, and 68.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

🔴 Problema 1: La pantalla está comiendo tu VRAM

Tu RTX 5070 Ti tiene 11.5 GiB pero solo aparecen 9.71 GiB libres — eso significa que \~1.8 GiB ya están ocupados por el servidor gráfico (Wayland/X11 corriendo en la dGPU). En CachyOS con Lenovo Legion, es muy común que el display corra sobre la Nvidia discreta.

Solución: Verificá con nvidia-smi y forzá el display a la iGPU Intel:

bash

# Ver qué está usando la VRAM ahora mismo nvidia-smi # En CachyOS/Arch, forzar display a iGPU (PRIME offload) # En /etc/environment o en tu sesión: export __NV_PRIME_RENDER_OFFLOAD=0

O más fácil: corré vLLM desde una TTY (Ctrl+Alt+F2) sin entorno gráfico activo en la dGPU. Eso te libera esos \~1.8 GiB.

🔴 Problema 2: RTX 5070 Ti es Blackwell (muy nuevo)

Esta GPU usa arquitectura Blackwell (sm_120) y tiene problemas conocidos con CUDA graphs en vLLM. Por eso se queda trabado en TRITON_ATTN. Necesitás agregar --enforce-eager para deshabilitar los CUDA graphs:

yaml

command: > --model google/gemma-4-E2B-it --host 0.0.0.0 --port 8000 --max-model-len 4096 --gpu-memory-utilization 0.85 --dtype bfloat16 --enforce-eager --trust-remote-code

✅ docker-compose.yml corregido

yaml

version: "3.8" services: vllm: image: vllm/vllm-openai:gemma4-cu130 ports: - "8000:8000" volumes: - model-cache:/root/.cache/huggingface environment: - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} - PYTORCH_ALLOC_CONF=expandable_segments: True # <-- evita fragmentación command: > --model google/gemma-4-E2B-it --host 0.0.0.0 --port 8000 --max-model-len 4096 --gpu-memory-utilization 0.85 --dtype bfloat16 --enforce-eager --trust-remote-code deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] shm_size: '2gb' # <-- importante para PyTorch restart: unless-stopped volumes: model-cache:

📋 Checklist de pasos

Corrí nvidia-smi antes de levantar Docker — si ves procesos usando VRAM, liberalos o pasá el display a la Intel

Agregá --enforce-eager (crítico para Blackwell)

Bajá --max-model-len a 4096

Agregá PYTORCH_ALLOC_CONF=expandable_segments:True

Agregá shm_size: '2gb'

El --enforce-eager es casi seguro la causa del colgado — es el fix más reportado para GPUs Blackwell en vLLM actualmente.

[-]

Content_Anything_908@reddit

Hay varios problemas combinados aquí. Te los explico uno por uno:

🔴 Problema 3: max-model-len 8192 reserva demasiado KV cache

El KV cache para 8192 tokens en Gemma 4 E2B en bfloat16 puede requerir \~2-3 GiB adicionales. Bajalo a 4096 primero para probar, y subilo después si te sobra memoria.

Corrí nvidia-smi antes de levantar Docker — si ves procesos usando VRAM, liberalos o pasá el display a la Intel
Agregá --enforce-eager (crítico para Blackwell)
Bajá --max-model-len a 4096
Agregá PYTORCH_ALLOC_CONF=expandable_segments:True
Agregá shm_size: '2gb'

Plastic-Parsley3094@reddit (OP)

Update, he provado son este docker-compose.yml con las sugerencias que me has dadod pero no. es lo mismo. No entiendo. he intentado en tty tambien.

shm_size: '2gb' #he variado a 4 .6.8

--enforce-eager

Muchas gracias por esta clarification. Escrebire un update después de intentar esposa pasos. Tu has tratado de correr algunos modelos en gpus personales? Como la mia o rtx 4070? Cual es tu experiencia con esos gpus? Yo de verdad quiero hacer esto a funcionar I educarme en esto. Perdona mi español have años que no escribo en español, 😁

reto-wyss@reddit

Try a low --max-num-seq value like 16 in command

i have tried vllm in docker as i dont get it to work in pip env in my laptop.

currently the "best" way to get it as compatible as possible with the latest models is to install vllm (nightly) first and then uv pip install -U transfomers