Tried running UI-TARS 7B on Colab free T4 — OOM'd
Posted by Long_Respond1735@reddit | LocalLLaMA | View on Reddit | 1 comments
Spent 30 minutes today trying to serve UI-TARS 1.5 7B via vLLM on Colab's free T4. OOM. The model weights alone are 14.2GB in FP16, and vLLM adds \~2GB overhead — T4 only has 15.6GB.
Switched to Ollama with a Q4 quant on Kaggle's free T4x2 and it worked fine. But I only figured this out after trial and error.
I know there are web-based VRAM calculators (apxml, gpuforllm, etc) but they don't account for:
- Runtime overhead (vLLM vs Ollama vs llama.cpp — big difference)
- Vision model encoder overhead (VLMs need extra VRAM for the vision encoder on top of the language model)
- Auto-detecting your actual GPU
Is there a CLI tool that does something like:
check ui-tars-7b --gpu t4 --runtime vllm
→ ❌ won't fit (17.1GB needed, 15.6GB available)
→ try Q4 via Ollama instead (4.5GB)
Or does everyone just trial-and-error it?
Status_Record_1839@reddit
Ollama with Q4 is the right call here. vLLM adds \~2GB overhead on top of weights, so 7B FP16 is always going to OOM on T4. For VLMs specifically the vision encoder eats another 1-2GB that most calculators ignore.