I accidentally built a universal streaming engine that runs 40GB models on 3GB VRAM
Posted by madtune22@reddit | LocalLLaMA | View on Reddit | 22 comments
While trying to run a LoRA on a 12GB GPU without OOMing, I discovered that cpu_offload + async prefetch hooks create a universal streaming engine for any transformer model.
The key insight: transformer blocks execute sequentially. You only need ONE block in VRAM at a time. While GPU computes block N, we DMA-transfer block N+1 from CPU RAM over PCIe. The GPU never waits.
Results on RTX 3060 12GB: - Z-Image-Turbo: needs 24GB → runs at 1.4GB VRAM - Wan2.2 I2V 14B: needs 80GB → runs at 2-4GB VRAM - Qwen-Image: needs 40GB → runs at 3GB VRAM (batch of 10 @ 1080p = 8GB)
No quantization. Full bfloat16. 130 lines of Python.
GitHub: https://github.com/madtunebk/streamforge
Klutzy-Snow8016@reddit
I think ComfyUI does some fancy memory management under the hood to solve the same problem. Have you compared the speed of your implementation to what you get running the same models in Comfy?
FullstackSensei@reddit
You just rediscovered mmap, like dozens of vibe coders who don't read about the 40+ year old mechanism that exists in all operating systems: mmap
CalligrapherFar7833@reddit
43 years since mmap was invented but actual modern mmap stable posix one is 30-31 years old :D
FullstackSensei@reddit
So, if the year was 2036, then OP wouldn't have an excuse for this slop? I hope LLMs will tell users not to slop such slop by then 😅
Uncle___Marty@reddit
https://github.com/deepbeepmeep/mmgp
Deepbeepmeep has been spreading the lowvram love for a while and it works great. Gotta love these rediscovery posts though lol.
opi098514@reddit
Did ai tell you this is something new…… because this isn’t. It’s been around for 10s of years.
madtune22@reddit (OP)
You're right, the technique isn't new. The value is 130 lines that plug into any diffusers model in one function call. No ComfyUI, no setup, just Python.
PhoneOk7721@reddit
Thats an ai response.
Certain-Cod-1404@reddit
Fairly certain projects like these already exist and aren't really used because of how slow it is to run inference this way, do you have data on the speed difference ?
madtune22@reddit (OP)
Similar projects exist but use quantization or disk streaming. StreamForge streams full bfloat16 weights from CPU RAM over PCIe — no quality loss.
Speed data from our tests on RTX 3060 12GB: - Qwen-Image 40GB model: ~18s/it at 768p, ~29s/it at 1080p (20 steps = ~6-10 min) - Wan2.2 I2V 14B: ~45s/it per frame - Z-Image 11GB: ~same as normal CPU offload
About 30-40% slower than native GPU inference. The tradeoff: you can actually run the model at all. A 4090 with 24GB VRAM can't even load Qwen-Image normally. We ran batch=10 @ 1080p using 8GB VRAM on a single RTX 3060.
Certain-Cod-1404@reddit
Isint there a project called air llm or something that also runs models in full precision one layer at a time. Is the 30-40% figure empirical or are you guesstimating?
siete82@reddit
In the README he claims this for the current implementation:
siete82@reddit
Sorry if I'm talking nonsense, but doesn't ComfyUI already do this?
tavirabon@reddit
I think natively now. Used to need block swap nodes that only worked on models it could properly detect.
Finanzamt_Endgegner@reddit
Well back in the day distorch and kijais nodes did this idk if they have it now in comfyui itself, but its definitely possible to use it
GTManiK@reddit
How does it compare to RamTorch?
https://github.com/lodestone-rock/RamTorch
tavirabon@reddit
Obviously poorly, one is actually engineered for performance.
matt-k-wong@reddit
is this similar or different from flash streaming?
Skystunt@reddit
How’s the speed? This looks cool !
madtune22@reddit (OP)
To be honest, it's a lot slower than native full GPU load — but it runs at full precision with no quantization. The limit is now your CPU RAM size, not VRAM.
TheDailySpank@reddit
Correct me if I'm wrong, but this means if you have 8GB VRAM, you can run any model that fits in CPU RAM, at full speed, with no quantization?
alok_saurabh@reddit
How is the generation speed ?