Built LazyMoE — run 120B LLMs on 8GB RAM with no GPU using lazy expert loading + TurboQuant

[-]

kymigreg@reddit

I'm going to need token speed estimates

Reply

[-]

I can try and guess. If the experts are in Q4, DeepSeek V3 would load roughly 10 GB per token. Assuming you have a PCIe 4.0 SSD with 6 GB/s sequential read speeds, you'd have roughly 2 seconds per token. Actually, probably less if you have DDR4 RAM, since most of your RAM bandwidth is eaten up by you reading from the SSD, on top of the reads necessary to actually compute the forward pass. I'd expect more like 5 seconds per token lol

Reply

[-]

Party-Special-5177@reddit

Ooh! I actually did this exact thing early last year for shits and giggles! Well, almost. Running full deepseek 671B in swap off of 2 raided gen 4 nvmes netted about 9 seconds a token. Prompt processing of my old-standard “perform the convolution of these 2 functions…” prompt took just under 2 days (it’s about 2 paragraphs, don’t remember number of tokens).

Reply

[-]

z_latent@reddit

Jesus, you did this using swap... I gotta pay respects to the TBW that your drives had to use up. Every time it read more parameters in, the memory that was getting overwritten had to be moved into swap, 10 GB written per token. If it output say, 1000 tokens, then something like 1% of your SSDs' total life span is gone. At least it was on a RAID lol

Reply

[-]

ReasonableRefuse4996@reddit (OP)

haha yeah you were right that was genuinely bad, didn't even think about the write cycles until you pointed it out just pushed a fix — mlock is now on by default so the OS can't touch swap anymore. also added a warning on startup if your RAM is too low so at least people get a heads up before they start grinding their drive into dust updated the README too with a proper warning section and a pointer to the system panel so users can check what's actually safe to run on their hardware before trying anything stupid appreciate you doing the math on that, genuinely useful catch

Reply

[-]

Party-Special-5177@reddit

Oh it was horrible lol But at the time I was determined to see running a SOTA model on home hardware with my own eyes, and it was worth damn near any cost. That was the first model I’d ever tried that could perform calculus without endless mistakes. Seriously, never had I ever seen a model that could convolve before.

Reply

[-]

ReasonableRefuse4996@reddit (OP)

Your math is pretty much right and I won't pretend otherwise. The 120B claim in the title is more about architectural capability than practical speed on 8GB RAM. On my actual machine I'm running Mistral 7B at 2-4 tok/s — that's the honest number. For DeepSeek V3 or anything above 70B on consumer hardware, SSD streaming makes it technically possible but painfully slow exactly as you calculated. The lazy expert loading helps reduce how much you're reading per token compared to loading everything, but it doesn't change the fundamental bandwidth constraint you're describing. The realistic sweet spot for this approach is 7B-14B models on 8GB RAM and 70B models on 32GB+ RAM where you actually get usable speeds. Anything bigger is a proof of concept more than a daily driver. Appreciate the detailed breakdown — this kind of analysis is exactly what the project needs.

Reply

[-]

Song-Historical@reddit

There's a project called hilos that is promising, it uses a smartssd to off load kv caching to an ssd, but I don't see a reason you couldn't build some sort of FPGA + flash memory stick/pcie card that will do the same. Inference stays on the GPU but the context management is no longer memory bound by your GPU.

Reply

[-]

z_latent@reddit

GPT-OSS 120B has 5.1B active params, 3.6B of those are experts. So \~1.8 GB at Q4. The remaining 1.5B dense params are just 750 MB, so they all fit in RAM comfortably. Same setup as before, 6 GB/s sequential read, would have \~0.3 seconds per token, at least in time spent reading. This is actually \~3 tok/s, so lemme go a bit more in detail in fact, since this is a bit less of a "joke" number. If you have DDR4-3200, so 25.6 GB/s bandwidth, you could get \~10 tok/s (so 0.1 s/tok) on a 5B dense model at Q4 (25.6 / 2.5). When doing this off-loading, we essentially include some reads into the SSD in-between as well. That means the time per token would be roughly 0.3 + 0.1 = 0.4s, which gives you \~2 tok/s, if you assume some extra overhead we're ignoring.^(\[\^1\]) ^(\[\^1\]: Ignore the thing I said in the original comment about your bandwidth being eaten up by the reading from SSD. I forgot that, in principle, those two can happen in parallel, aka the RAM is being written to throughout those 0.3 seconds of reading from SSD.)

Reply

[-]

Hougasej@reddit

Latency would eat half of speed, even if you load all in ram you'll get at best 70% speed of maximum throughput. Ssd is much slower in response, especially with random read sparse moe.

Reply

[-]

z_latent@reddit

Latency is much bigger even on NVMe, true, but won't be as bad as you may think. I ommited it because it's still too small compared to the actual time spent reading. I arrived at 0.3s to read all experts for one token. Since the model has 36 layers, then we have ~8 ms per layer, plus latency. However, modern SSD latency is around 0.1 ms. Much bigger than DRAM latency, for sure, but only enough to raise our total time per layer to 8.1 ms. It's fine. Though it would matter if we were loading smaller things. For instance, I did similar math on per-layer embedding offloading (for Gemma 4 E4B), and in that case it was overwhelmingly limited by read latency because the PLEs loaded per token are like, 100 000x smaller than experts.

Reply

[-]

TheRealMasonMac@reddit

Closest I got is w/ Qwen3.5-122B at IQ1 w/ 64gb RAM and 12gb VRAM w/ 600TPS prompt processing and 20 TPS generation. It was fairly coherent. Without a GPU or enough RAM, I'd guess it's probably like 1TPS generation.

Reply

[-]

ReasonableRefuse4996@reddit (OP)

Great question! Here are the real numbers from my Intel UHD 620 laptop (8GB RAM, no GPU, no dedicated graphics): Mistral 7B Q4\_K\_M: \- First query: \~2 min load time (model loading into RAM) \- Subsequent queries: \~2-4 tok/s (model stays loaded via llama-server) These are CPU-only numbers on a budget laptop from 2018. On better hardware you can expect: Intel Core i7/i9 (no GPU): 4-8 tok/s for 7B AMD Ryzen 9 (no GPU): 6-10 tok/s for 7B Apple M2 MacBook (8GB unified): 15-25 tok/s for 7B Apple M3 Pro (18GB unified): 30-40 tok/s for 7B NVIDIA RTX 3090 (24GB VRAM): 80-120 tok/s for 7B For larger models via SSD streaming (mmap): Mixtral 8x7B on 32GB RAM: \~3-5 tok/s Llama 3 70B on 64GB RAM: \~1-2 tok/s The lazy expert loading helps most with MoE models like Mixtral — instead of loading all experts, only the 2 active experts per token are kept in RAM, which reduces memory pressure significantly. Would love to hear your numbers if you test it on your setup!

Reply

[-]

waruby@reddit

It's going to be in seconds per token.

Reply

[-]

justan0therusername1@reddit

1 token an hour

Reply

[-]

hesperaux@reddit

Slop post

Reply

[-]

Ok_Weakness_5253@reddit

Slop user.. boot this nincompoop

Reply

[-]

hesperaux@reddit

Are you kidding? This post is such obvious ai gen text. And so are the responses from OP.

Reply

[-]

Ok_Weakness_5253@reddit

What planet are reddit people on man...

Reply

[-]

No-Anchovies@reddit

Seems legit https://preview.redd.it/tj1h9r598vug1.png?width=800&format=png&auto=webp&s=3c8eb59506a451408989f0bb8e5e489c4f733ace

Reply

[-]

BigJay125@reddit

i think a more interesting usecase would be running the 20-40B parameter models on e.g an iPhone

Reply

[-]

aero-spike@reddit

Can someone test it out for me and tell me the speed?

Reply

[-]

mwallace0569@reddit

It’s painfully slow

Reply

[-]

KarenBoof@reddit

You can run Claude Mythos on your iPhone 5

Reply

[-]

aero-spike@reddit

Are you for real or nah?

Reply

[-]

xeeff@reddit

he's serious. have you been living under a rock?

Reply

[-]

bapuc@reddit

Working on this too! Have you ever considered embeding the model AND the engine in a single .bin file? With some optimizations you canget up to 4x speedups (I called that thing Flasking, still working into releasing a version of that) It runs faster because it doesn't longer have to do the roundtrips from python to the model file back and forth.

Reply

[-]

MrEU1@reddit

Is it possible to check rotorquant in the place of turbo quant?

Reply

[-]

Everlier@reddit

The post doesn't mention, but the repo says you also got bitnet in there, apparently all of these features only needed two commits, one of which was README update. This drives my slopradar off the charts.

Reply

[-]

dark_bits@reddit

Oh it’s definitely vibe coded. You can clearly see the overly structured single line comments that add almost no semantic meaning to the code itself

Reply

[-]

ReasonableRefuse4996@reddit (OP)

Fair point — the commit history is thin because I built this iteratively locally and pushed a clean version at the end. I should have committed more frequently. On BitNet — you're right, it's not implemented yet. It's a roadmap item and I've updated the README to make that clearer. My mistake for not being explicit about that upfront. What's actually working and tested on my machine: llama-server bridge, LRU expert cache, query domain classifier, TurboQuant KV math, GGUF auto-detection, hardware compatibility matrix, and SSE token streaming. I'm a master's student building this to learn. Not a polished production system — but everything listed above genuinely runs on my 8GB Intel UHD laptop. Happy to take issues on the repo if you find specific things that don't work.

Reply

[-]

PhilosophyforOne@reddit

Something I’m learning reading these posts is that people should provide actual hard evidence to go with their posts.

Reply

[-]

fugogugo@reddit

Turbo quant is open source?

Reply

[-]

ReasonableRefuse4996@reddit (OP)

Yes it is open source

Reply

[-]

xdriver897@reddit

That 1 bit quantization sounds like the model gets lobitomized to max; 16FP can easy go to 8 as a sweet spot, even 6 or 4 if you have less ram, but 3 or lower is imho too much loss;

Reply

[-]

ReasonableRefuse4996@reddit (OP)

Totally valid concern and it's a common misconception worth clarifying. 1-bit here refers to BitNet b1.58 specifically — weights are ternary {-1, 0, +1} rather than just binary {0, 1}. The key difference is that BitNet models are TRAINED from scratch at 1-bit, not post-quantized from fp16. That's what makes the quality surprisingly competitive. You're absolutely right that aggressively quantizing a normal fp16 model down to 2-3 bit is lobotomy territory. GPTQ/GGUF quantization below Q4 degrades fast. But BitNet trained natively at 1-bit is a different story — Microsoft's research shows near parity with fp16 at 70B+ scale because the model learns to work within the constraint from day one rather than having precision crushed out of it afterward. The practical catch is that true BitNet 70B+ weights don't publicly exist yet. So right now the project uses standard Q4\_K\_M for actual inference and BitNet is on the roadmap for when those weights drop. Your instinct on Q4 being the sweet spot for standard models is spot on though — that's exactly what I'm running locally.

Reply

[-]

Any-Construction6686@reddit

i have full video on this project TurboQuant with LazyMoE [https://youtube.com/shorts/g3CPUCMT0wU?feature=shared](https://youtube.com/shorts/g3CPUCMT0wU?feature=shared)

Reply

[-]

ForsookComparison@reddit

Do Germans use a lot of emdashes and throwaway accounts

Reply

[-]

JayPSec@reddit

If you're running loose agents at least you could tell them "No em dashes" on top of the mandatory "Make no mistakes"... tsc tsc tsc. That's true SLOPiness...

Reply

[-]

SirBraxton@reddit

Depends on how much you're willing to pay. :)

Reply

[-]

Chromix_@reddit

I should have simply downvoted this and saved time when I saw the dashes in the post, instead of investigating the code. This simply wastes RAM instead of doing anything beneficial. It spawns a regular llama-server that uses mmap, and then manually mmaps "expert shards" (that llama-server does not use) and stores an even more useless [additional copy of that data](https://github.com/patilyashvardhan2002-byte/lazy-moe/blob/78f14702a5d8646af3df5a74736674c9d219eb29/backend/expert_cache.py#L154) in RAM. Even if that would somehow work, the whole thing "works" by statically assigning a small number of experts to load [based on the topic](https://github.com/patilyashvardhan2002-byte/lazy-moe/blob/78f14702a5d8646af3df5a74736674c9d219eb29/backend/query_analyzer.py#L23) of the request - which does not really work with modern MoEs.

Reply

[-]

z_latent@reddit

Wow. I too, took (wasted) my time making some estimates of how fast that could run, though I was assuming it was a proper implementation lol. What even is the point of this... >manually mmaps "expert shards" (that llama-server does not use) Yeah I tried reading into the code but I couldn't find where it actually sent data to llama.cpp. Does it really just, *not* do that? Is it just an attempt at pre-fetching the experts so their pages are already cached by the OS when llama.cpp tries to load them?? At the cost of, indeed, doubling the memory used by the experts lol

Reply

[-]

Dany0@reddit

God a venereal disease would be better. At least if the post spread malware I could see why they tried to waste everyone's time Now everyone's time is wasted and not even transfer of wealth into criminal hands occurred World does not make sense anymore

Reply

[-]