I accidentally built a universal streaming engine that runs 40GB models on 3GB VRAM

Posted by madtune22@reddit | LocalLLaMA | View on Reddit | 22 comments

While trying to run a LoRA on a 12GB GPU without OOMing, I discovered that cpu_offload + async prefetch hooks create a universal streaming engine for any transformer model.

The key insight: transformer blocks execute sequentially. You only need ONE block in VRAM at a time. While GPU computes block N, we DMA-transfer block N+1 from CPU RAM over PCIe. The GPU never waits.

Results on RTX 3060 12GB: - Z-Image-Turbo: needs 24GB → runs at 1.4GB VRAM - Wan2.2 I2V 14B: needs 80GB → runs at 2-4GB VRAM - Qwen-Image: needs 40GB → runs at 3GB VRAM (batch of 10 @ 1080p = 8GB)

No quantization. Full bfloat16. 130 lines of Python.

GitHub: https://github.com/madtunebk/streamforge

[-]

Klutzy-Snow8016@reddit

I think ComfyUI does some fancy memory management under the hood to solve the same problem. Have you compared the speed of your implementation to what you get running the same models in Comfy?

[-]

FullstackSensei@reddit

You just rediscovered mmap, like dozens of vibe coders who don't read about the 40+ year old mechanism that exists in all operating systems: mmap

[-]

CalligrapherFar7833@reddit

43 years since mmap was invented but actual modern mmap stable posix one is 30-31 years old :D

[-]

FullstackSensei@reddit

So, if the year was 2036, then OP wouldn't have an excuse for this slop? I hope LLMs will tell users not to slop such slop by then 😅

[-]

Uncle___Marty@reddit

https://github.com/deepbeepmeep/mmgp

Deepbeepmeep has been spreading the lowvram love for a while and it works great. Gotta love these rediscovery posts though lol.

[-]

opi098514@reddit

Did ai tell you this is something new…… because this isn’t. It’s been around for 10s of years.

[-]

madtune22@reddit (OP)

You're right, the technique isn't new. The value is 130 lines that plug into any diffusers model in one function call. No ComfyUI, no setup, just Python.

[-]

Certain-Cod-1404@reddit

Fairly certain projects like these already exist and aren't really used because of how slow it is to run inference this way, do you have data on the speed difference ?

[-]

madtune22@reddit (OP)

Similar projects exist but use quantization or disk streaming. StreamForge streams full bfloat16 weights from CPU RAM over PCIe — no quality loss.

Speed data from our tests on RTX 3060 12GB: - Qwen-Image 40GB model: ~18s/it at 768p, ~29s/it at 1080p (20 steps = ~6-10 min) - Wan2.2 I2V 14B: ~45s/it per frame - Z-Image 11GB: ~same as normal CPU offload

About 30-40% slower than native GPU inference. The tradeoff: you can actually run the model at all. A 4090 with 24GB VRAM can't even load Qwen-Image normally. We ran batch=10 @ 1080p using 8GB VRAM on a single RTX 3060.

[-]

Certain-Cod-1404@reddit

Isint there a project called air llm or something that also runs models in full precision one layer at a time. Is the 30-40% figure empirical or are you guesstimating?

[-]

siete82@reddit

In the README he claims this for the current implementation:

\~30-40% slower than native GPU inference
No pinned memory (avoids OOM on 64GB systems)

[-]

siete82@reddit

Sorry if I'm talking nonsense, but doesn't ComfyUI already do this?

[-]

tavirabon@reddit

I think natively now. Used to need block swap nodes that only worked on models it could properly detect.

[-]

Finanzamt_Endgegner@reddit

Well back in the day distorch and kijais nodes did this idk if they have it now in comfyui itself, but its definitely possible to use it

[-]

GTManiK@reddit

How does it compare to RamTorch?

https://github.com/lodestone-rock/RamTorch

[-]

tavirabon@reddit

Obviously poorly, one is actually engineered for performance.

[-]

matt-k-wong@reddit

is this similar or different from flash streaming?

[-]

Skystunt@reddit

How’s the speed? This looks cool !

[-]

madtune22@reddit (OP)

To be honest, it's a lot slower than native full GPU load — but it runs at full precision with no quantization. The limit is now your CPU RAM size, not VRAM.

[-]

TheDailySpank@reddit

Correct me if I'm wrong, but this means if you have 8GB VRAM, you can run any model that fits in CPU RAM, at full speed, with no quantization?

[-]

alok_saurabh@reddit

How is the generation speed ?