Running a 26B LLM locally with no GPU
Posted by JackStrawWitchita@reddit | LocalLLaMA | View on Reddit | 72 comments
This is crazy. I've been running local LLMs on CPU only for awhile now and have great results with 12B models running on an i5-8500 and only 32GB of RAM with no GPU. But I've got a version of Gemma4 26B running really fast on the same machine which isn't even breaking a sweat.
It is simply amazing what can run without a GPU.
okyaygokay@reddit
what the hell, how? is it igpu?
Formal-Exam-8767@reddit
Speed is expected since Gemma4 26B is MoE with only 4B active parameters.
igpu won't help here since LLMs are memory-bandwidth bound, not compute bound. And that poor igpu is ill suited for anything demanding.
LevianMcBirdo@reddit
A good igpu still helps a lot in the pp phase. Something like a 780m helps. But year a i5 8th Gen had a max bandwidth around 50GB/s in dual channel, so a full Q8 probably runs a lot under 10t/s
Downtown-Pear-6509@reddit
how can i use my 780m on my 8845hs on windows with lmstudio? i dont want to have to switch to linux just for this
LevianMcBirdo@reddit
You can just activate it under general settings and them choose per model GPU and CPU offload
Silver-Champion-4846@reddit
I have UHD 620. Any Intel-friendly quants and inference engines?
Silver-Champion-4846@reddit
I would very much like to know more about this thing you're talking about. I myself have Core i5 8350U processor with 8 gigabytes of RAM. My laptop, Dell Latitude 5590, can be upgraded to the maximum of 32 gigabytes of DDR4 RAM. So I am really, really interested in this so-called 26 billion parameter performance of yours. Especially since you have nearly the same CPU as me, the same generation at least, just mine is an ultra-low power one. Please inform me. I really appreciate it. Thank you.
SlowStopper@reddit
You have 2 DIMM slots, you can probably upgrade to 64 GB using 2x32 GB sticks. It's not mentioned anywhere because when 8-series was released there were no such sticks, but it should work (I tested it on a few systems).
Silver-Champion-4846@reddit
If the bios doesn't prevent it and there are ddr4 2400mhz 32gb sticks, and the motherboard doesn't kick my hope out of the window, then sure
SlowStopper@reddit
BIOS won't prevent it, motherboard would have to specifically designed to block it, yes the only problem would be module compatibility - just buy them somewhere you can return them.
Silver-Champion-4846@reddit
Thanks for the info
JackStrawWitchita@reddit (OP)
I think your 8350 may struggle as the 8500s are significantly different, apparently.
My stack is:
i5-8500
32GB RAM
Linux Mint 22.3
And I run LLMs using KoboldCPP using Kobold Lite from a browser.
And the LLM is gemma-4-26B-A4B-it-uncensored-heretic-Q3_K_L.gguf from HF.
we_are_mammals@reddit
Are there benchmark scores for this quantization? How do they compare to the original?
JackStrawWitchita@reddit (OP)
Here's the card:
https://huggingface.co/llmfan46/gemma-4-26B-A4B-it-uncensored-heretic-GGUF
we_are_mammals@reddit
It looks like they didn't run the benchmarks (MMLU Pro, AIME 2026 no tools, etc.) using the quantizations.
Silver-Champion-4846@reddit
Q3 hmm makes sense. How much quality loss?
wowsers7@reddit
Are there any hacks for getting Qwen 3.6 27B running at a decent speed on Windows CPU only with 32GB of RAM? I have a fast CPU: Intel Core Ultra 9 285K. Maybe MTP, Fflash, or PFlash?
Ordynar@reddit
I tested Qwen 3.6 35B A3B on Intel Core Ultra 7 270K and 6000Mhz CL28 memory.
Got initially 19 t/s After 22k context size it goes down to 10 t/s
Prompt processing is quite slow 50-100 t/s and with larger context each prompt starts to take minutes before you will see first token of response.
GoodTip7897@reddit
That's because Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model. Even though Qwen 3.6 27B has just 1B more total parameters, it will run about 8x slower or so because it is a dense model that activates every parameter.
CooperDK@reddit
Then he should get Qwen3.6-35B-A3B. One billion less parameters active. Should be 25% faster. It isn't though.
CM0RDuck@reddit
Its a diff arch
CooperDK@reddit
Exactly right. Qwen is trying but slowly failing both when it comes to LLM and image generation. Their image generating model is far too large for what it can do.
CM0RDuck@reddit
What are you talking about, qwen3.6 27b is a game changer
CooperDK@reddit
Not really, compared to the current other models. And it depends on what you use it for.
CM0RDuck@reddit
Sure bud. A 27b dense model outperforming models ten times it size is no biggy.
Its easy to hold opinions when you are as vague as you are being. Honestly, its probably just user error on your end.
KURD_1_STAN@reddit
What 270b dense model is qwen 27b beating? I know there isnt any such new models anymore. But lets be real it good, but not 70b dense good, let alone 270b. U better mot mention mistral.
CM0RDuck@reddit
Benchmarks across many many domains of expertise are completely FREE online. Knock yourself out. Or dont and be behind, i dont care.👍
GoodTip7897@reddit
Yeah if Gemma was gated delta net then a4b would be slower than qwen 35b and 31b would be slower than qwen 27b
GoodTip7897@reddit
It's roughly proportional when you go from 4b to 27b. But not so much for smaller sizes... Also I think delta gated net kernels aren't as optimized as traditional iSWA. It's not like any inference software reaches much more than 90% of theoretical bandwidth.
I do get higher bandwidth with Gemma 4 than qwen. (31b runs faster than 27b).
KURD_1_STAN@reddit
Q 256 experts vs G 128. Qwen needs more offloading and loading if it doesnt fit in ram. it shouldn't be the case for cpu as u dont need that, altho not sure how that work with cache.
Wonderful-Pie-4940@reddit
Gemma 4 is a moe model and most probably you are running the e4B model which basically means at inference time only 4B params are active
DigitalguyCH@reddit
what speed is your RAM?
JackStrawWitchita@reddit (OP)
(base) dell@dell-OptiPlex-3060:\~$ inxi -
Memory:
System RAM: total: 32 GiB available: 31.17 GiB used: 4.45 GiB (14.3%)
Array-1: capacity: 32 GiB slots: 2 modules: 2 EC: None
Device-1: DIMM1 type: DDR4 size: 16 GiB speed: 2400 MT/s
Device-2: DIMM2 type: DDR4 size: 16 GiB speed: 2400 MT/s
DigitalguyCH@reddit
wow it's not even DDR5... I struggle with 3-4 t/s on my DDR5 with over double the speed, but I guess it also depends on prompt and context, maybe you should make an example for us to compare
VoiceApprehensive893@reddit
20 tokens/second on ddr5 ram is really nice especially since this model actually can get a lot of things people use llms for done
APFrisco@reddit
Out of curiosity what do you use the models you run on your CPU for? Experimentation or something else?
I really like CPU inference, it’s such a great way to be able to run models that wouldn’t fit fully on my GPU.
JackStrawWitchita@reddit (OP)
Experimentation, just to learn the basics of this stuff.
I was looking to buy a GPU, saw the prices and couldn't really justify it. So I started messing around with smaller models on my old gear and here we are.
APFrisco@reddit
Nice, yeah good plan!
pmttyji@reddit
Of course MOE models(Small/Medium particularly) could run at decent speed just with CPU-only inference. In past, I did post a thread on this which has both MOE & Dense models.
CPU-only LLM performance - t/s with llama.cpp
SethMatrix@reddit
X
JackStrawWitchita@reddit (OP)
How much did you spend on your GPU?
SethMatrix@reddit
The server has my old gpu, an rtx 3080 from Facebook marketplace for $400.
SettingAgile9080@reddit
Haven't tried CPU inference for a while and back then (6 months?) it was painfully slow, interesting to see these MoE models running sort-of acceptably well on CPU.
Did a full bench sweep (custom self-improving script generated with Claude Code/Opus 4.7) on Gemma 4 26B-A4B Q4_K_XL on an i7-14700K + 96GB DDR5, CPU only via llama.cpp Docker image.
Real-world server numbers (warmed up, \~200 tok prompt → 300 tok gen):
Prompt Processing (PP): \~90 tok/s
Token Generation (TG): \~13 tok/s
Bench notes:
--threads 8 --threads-batch 28in llama-server and you get both peaks from the same process. Setting threads=8 for everything caps PP at \~73; threads=28 for everything tanks TG to \~11. For short interactive prompts, forcing to P-cores might be worthwhile but not worth it for longer or background tasks.docker --cpuset-cpus=0-15to force everything onto P-cores looked great in synthetic bench (80 PP / 14.5 TG) — but in real serving PP collapsed to 44 tok/s. OpenMP HT contention under HTTP + sampling load. So llama-bench numbers don't always translate to live serving.mmapon/off,ubatch256/512/1024 (within noise; <256 hurts), Flash Attn \~+2%. KV cache: stick with f16 unless you also use -fa 1 (q8_0/q4_0 KV refuse to load without flash-attn).btop(beautiful TUI) this didn't seem to max out my full CPU, just individual cores. Surprisingly there wasn't much temperature spiking.For OP on i5-8500 + DDR4: expect roughly half the TG (\~6-7 tok/s) since dual-channel DDR4 is \~40 GB/s vs DDR5's \~80, and PP will be lower again because of fewer cores. Would need \~22GB to load this into memory and have 128K context. Still very usable for an "almost-27B" model.
My GPU isn't the hottest so wondering if this would be a good way to run a second (or more) model at the same time, so my background batch jobs where I don't care about speed can use the CPU during the hours when I am actively using the GPU.
Here's my serve-cpu.sh. This is tuned for my CPU, might need tweaking for other setups:
CarlosEduardoAraujo@reddit
tok/sec??
JackStrawWitchita@reddit (OP)
Using Koboldcpp on Linux, I entered: "The car wash is 100 meters from my house. Should I walk to the car wash or drive there?"
Processing Prompt (30 / 30 tokens)
Generating (45 / 1024 tokens)
(Stop sequence triggered: User:)
[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s
Output: If you're just going for a quick car wash, walking might be easier and more environmentally friendly. However, if you have a lot of heavy equipment or are in a hurry, driving might be better.
CarlosEduardoAraujo@reddit
Can you try this prompt:
Irei fazer uma corrida de endurance, o tempo total é de 3 horas. O tempo de cada volta é de 1 minuto e 41 segundos. Com um tanque de gasolina, que tem capacidade de 100 litros, é possível fazer 36 voltas. Quantos litros irei precisar para fazer as 3 horas de corrida?
NightCulex@reddit
I love questions like this.
Maleficent-Ad5999@reddit
Please share t/s
Sooperooser@reddit
You can expect low double digit t/s for MoE 25-35b quant models and low single digit t/s for dense models of that size running consumer CPU only. I got like 16 t/s running Qwen3.6 35b a3b q4-6 quants on a r7 3700x and 64gb dual channel ddr4 ram with GPU use switched off.
KURD_1_STAN@reddit
Which 27b quant? im also getting like single digit with 3060 12gb even at q3
peligroso@reddit
3060 is anemic at inference speed even compared to 2080/3070. Gonna be hard to get a good setup with that driving your CUDAs.
KURD_1_STAN@reddit
~350GB/s compared to like what 50? Shouldn't be this close to a cpu
peligroso@reddit
whoosh
Sooperooser@reddit
Q4KM iirc
JackStrawWitchita@reddit (OP)
You answered it better than I but all I know is the 26B Gemma 4 is running faster than the 12B dense LLMs I usually run, and the CPU isn't even breaking a sweat.
From a user perspective, after the initial query, subsequent queries respond near instantaneously.
JackStrawWitchita@reddit (OP)
23.13T/s
LetsGoBrandon4256@reddit
23.13T/s is the prompt processing speed. The actual generation speed is 9.25T/s.
Still pretty usable for pure CPU inference on an i5 with DDR4
JackStrawWitchita@reddit (OP)
Thanks for the clarity!
ArchdukeofHyperbole@reddit
Yeah, I think it's amazing too. the moe models are more cpu friendly, like night and day compared to dense models. A dense 7B is barely usable on my pc and causes my pc to lag. For moe, basically just picking a model with 2B or 3B active parameters and you can get by. Even if it's a bit slower than using online models, it's incredible to have access to offline intelligence anyhow.
When I started getting into llms, I really wanted to use llama 70B but even at q1 quant, it didn't really work. Qwen next and others are faster and smarter than the models I initially wanted to run and I didn't have to buy hardware, just waited for llm efficiency gains basically.
BitGreen1270@reddit
I am surprised you are getting 23 t/s. I have a 32gb ram Ryzen 7 250 with 780m igpu and I'm getting roughly 18-20 t/s. I see gpu usage go up. So how come it's about the same? Does your system become less responsive when the llm is running?
LetsGoBrandon4256@reddit
OP confused the prefill speed with the generation speed. It's actually 9T/s
BitGreen1270@reddit
Ah that makes sense, thanks for sharing. I thought maybe I was running only on CPU 😄.
JackStrawWitchita@reddit (OP)
I've given kobobold 5 of my 6 threads so the LLM runs and I can still run a browser and stuff with no issues while the LLM chugs away.
BitGreen1270@reddit
Ah okay - that's smart. I'm glad you breathed new life into an older system. Now you can try tweaking it to squeeze as much performance as you can out of it.
Successful_Plant2759@reddit
The useful distinction here is total params versus active params, plus memory bandwidth. If Gemma 4 26B is MoE and only lights up a small slice per token, CPU-only can feel much better than a dense 26B. That is why tokens/sec, quant level, RAM speed, and batch size matter more than the headline parameter count. Would be great to see those numbers so people do not overgeneralize from this to every 26B model.
Bulky-Priority6824@reddit
Yea, I'm sure. Don't ever stop,
Hofi2010@reddit
I can’t believe that. Can you share a repo with how everything is setup so we can verify your results. And some more performance metrics like t/s and context window would be good to know
JackStrawWitchita@reddit (OP)
There's no repo to set up:
I've got an old Dell optiplex with an i5-8500, 32GB of RAM, running Linux 22.3 and using KoboldCPP with Kobold Lite front end. I've loaded gemma-4-26B-A4B-it-uncensored-heretic-Q3_K_L.gguf from HF and fired it off.
I entered: "The car wash is 100 meters from my house. Should I walk to the car wash or drive there?"
Processing Prompt (30 / 30 tokens)
Generating (45 / 1024 tokens)
(Stop sequence triggered: User:)
[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s
Output: If you're just going for a quick car wash, walking might be easier and more environmentally friendly. However, if you have a lot of heavy equipment or are in a hurry, driving might be better.
Silver-Champion-4846@reddit
Yeah leaving us hanging like that is....
CooperDK@reddit
You mean Gemma 4 26B-A4B. 4B active parameters. But you are pushing it, loading it even in q4 takes more than half of your RAM and that is not even counting the operating system. So it is chewing your virtual memory too.
Queasy-Contract9753@reddit
It's pretty cool how far we've come huh? I use Gemma 4 often on googles API free tier. Might go local if I can even buy more ram. It's definitely smarter than the first chatGPT.
Back then when they said we'll have local gpt3 level LLMs one day I thought it was bullshit. Can't wait to see what's around the corner next.
cosmos_hu@reddit
Sounds nice, imma test it later too :D