gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM
Posted by Plastic-Parsley3094@reddit | LocalLLaMA | View on Reddit | 4 comments
I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error :
Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
so i guess i dont have memory . but I have seen people run gemma even 26B on 12 GBvram withou any issues and good speeds. So i dont have any idea what i am doing wrong please help.
running a quantize model like prithivMLmods/gemma-4-E2B-it-FP8 it get stuck in:
vllm-1 | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
vllm-1 | (EngineCore pid=157) INFO 04-16 09:33:43 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
Hardware : Lenovo legion pro i5
CPU: Intel(R) Core(TM) Ultra 9 275HX (24) @ 5.40 GHz
GPU 1: NVIDIA GeForce RTX 5070 Ti Mobile 12GB VRAM [Discrete]
GPU 2: Intel Graphics [Integrated]
Memory: 32 GB
OS linux arch (cachyos)
i have tried vllm in docker as i dont get it to work in pip env in my laptop.
docker-compose.yml
version: "3.8"
services:
vllm:
# build: .
image: vllm/vllm-openai:gemma4-cu130
ports:
- "8000:8000"
volumes:
- model-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model goole/gemma-4-E2B-it
--host 0.0.0.0
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.90
--dtype bfloat16
--trust-remote-code
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
model-cache:
logs from docker compose -f vllm:
ValueError: Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.35 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
vllm-1 | [rank0]:[W416 09:04:45.775515380 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
I have even decrease gpu-memmory-utiliztion and i get then error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 11.50 GiB of which 75.44 MiB is free. Including non-PyTorch memory, this process has 9.79 GiB memory in use. Of the allocated memory 9.47 GiB is allocated by PyTorch, and 68.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Content_Anything_908@reddit
Hay varios problemas combinados aquí. Te los explico uno por uno:
🔴 Problema 1: La pantalla está comiendo tu VRAM
Tu RTX 5070 Ti tiene 11.5 GiB pero solo aparecen 9.71 GiB libres — eso significa que \~1.8 GiB ya están ocupados por el servidor gráfico (Wayland/X11 corriendo en la dGPU). En CachyOS con Lenovo Legion, es muy común que el display corra sobre la Nvidia discreta.
Solución: Verificá con
nvidia-smiy forzá el display a la iGPU Intel:bash
O más fácil: corré vLLM desde una TTY (Ctrl+Alt+F2) sin entorno gráfico activo en la dGPU. Eso te libera esos \~1.8 GiB.
🔴 Problema 2: RTX 5070 Ti es Blackwell (muy nuevo)
Esta GPU usa arquitectura Blackwell (sm_120) y tiene problemas conocidos con CUDA graphs en vLLM. Por eso se queda trabado en
TRITON_ATTN. Necesitás agregar--enforce-eagerpara deshabilitar los CUDA graphs:yaml
🔴 Problema 3: max-model-len 8192 reserva demasiado KV cache
El KV cache para 8192 tokens en Gemma 4 E2B en bfloat16 puede requerir \~2-3 GiB adicionales. Bajalo a 4096 primero para probar, y subilo después si te sobra memoria.
✅ docker-compose.yml corregido
yaml
📋 Checklist de pasos
nvidia-smiantes de levantar Docker — si ves procesos usando VRAM, liberalos o pasá el display a la Intel--enforce-eager(crítico para Blackwell)--max-model-lena4096PYTORCH_ALLOC_CONF=expandable_segments:Trueshm_size: '2gb'El
--enforce-eageres casi seguro la causa del colgado — es el fix más reportado para GPUs Blackwell en vLLM actualmente.Plastic-Parsley3094@reddit (OP)
Update, he provado son este docker-compose.yml con las sugerencias que me has dadod pero no. es lo mismo. No entiendo. he intentado en tty tambien.
Plastic-Parsley3094@reddit (OP)
Muchas gracias por esta clarification. Escrebire un update después de intentar esposa pasos. Tu has tratado de correr algunos modelos en gpus personales? Como la mia o rtx 4070? Cual es tu experiencia con esos gpus? Yo de verdad quiero hacer esto a funcionar I educarme en esto. Perdona mi español have años que no escribo en español, 😁
reto-wyss@reddit
Try a low --max-num-seq value like 16 in command
currently the "best" way to get it as compatible as possible with the latest models is to install vllm (nightly) first and then uv pip install -U transfomers