Running a 26B LLM locally with no GPU

[-]

Formal-Exam-8767@reddit

Speed is expected since Gemma4 26B is MoE with only 4B active parameters.

igpu won't help here since LLMs are memory-bandwidth bound, not compute bound. And that poor igpu is ill suited for anything demanding.

[-]

LevianMcBirdo@reddit

A good igpu still helps a lot in the pp phase. Something like a 780m helps. But year a i5 8th Gen had a max bandwidth around 50GB/s in dual channel, so a full Q8 probably runs a lot under 10t/s

[-]

Downtown-Pear-6509@reddit

how can i use my 780m on my 8845hs on windows with lmstudio? i dont want to have to switch to linux just for this

[-]

LevianMcBirdo@reddit

You can just activate it under general settings and them choose per model GPU and CPU offload

[-]

Silver-Champion-4846@reddit

I have UHD 620. Any Intel-friendly quants and inference engines?

[-]

I would very much like to know more about this thing you're talking about. I myself have Core i5 8350U processor with 8 gigabytes of RAM. My laptop, Dell Latitude 5590, can be upgraded to the maximum of 32 gigabytes of DDR4 RAM. So I am really, really interested in this so-called 26 billion parameter performance of yours. Especially since you have nearly the same CPU as me, the same generation at least, just mine is an ultra-low power one. Please inform me. I really appreciate it. Thank you.

[-]

SlowStopper@reddit

You have 2 DIMM slots, you can probably upgrade to 64 GB using 2x32 GB sticks. It's not mentioned anywhere because when 8-series was released there were no such sticks, but it should work (I tested it on a few systems).

[-]

Silver-Champion-4846@reddit

If the bios doesn't prevent it and there are ddr4 2400mhz 32gb sticks, and the motherboard doesn't kick my hope out of the window, then sure

[-]

SlowStopper@reddit

BIOS won't prevent it, motherboard would have to specifically designed to block it, yes the only problem would be module compatibility - just buy them somewhere you can return them.

[-]

Silver-Champion-4846@reddit

Thanks for the info

[-]

JackStrawWitchita@reddit (OP)

I think your 8350 may struggle as the 8500s are significantly different, apparently.

My stack is:

i5-8500

32GB RAM

Linux Mint 22.3

And I run LLMs using KoboldCPP using Kobold Lite from a browser.

And the LLM is gemma-4-26B-A4B-it-uncensored-heretic-Q3_K_L.gguf from HF.

[-]

we_are_mammals@reddit

Q3_K_L

Are there benchmark scores for this quantization? How do they compare to the original?

[-]

JackStrawWitchita@reddit (OP)

Here's the card:

https://huggingface.co/llmfan46/gemma-4-26B-A4B-it-uncensored-heretic-GGUF

[-]

we_are_mammals@reddit

It looks like they didn't run the benchmarks (MMLU Pro, AIME 2026 no tools, etc.) using the quantizations.

[-]

Silver-Champion-4846@reddit

Q3 hmm makes sense. How much quality loss?

[-]

wowsers7@reddit

Are there any hacks for getting Qwen 3.6 27B running at a decent speed on Windows CPU only with 32GB of RAM? I have a fast CPU: Intel Core Ultra 9 285K. Maybe MTP, Fflash, or PFlash?

[-]

Ordynar@reddit

I tested Qwen 3.6 35B A3B on Intel Core Ultra 7 270K and 6000Mhz CL28 memory.

Got initially 19 t/s After 22k context size it goes down to 10 t/s

Prompt processing is quite slow 50-100 t/s and with larger context each prompt starts to take minutes before you will see first token of response.

[-]

GoodTip7897@reddit

That's because Gemma 4 26B is a mixture of experts model that only uses 4B parameters every token. So it should be about as fast as a 4B model. Even though Qwen 3.6 27B has just 1B more total parameters, it will run about 8x slower or so because it is a dense model that activates every parameter.

[-]

CooperDK@reddit

Then he should get Qwen3.6-35B-A3B. One billion less parameters active. Should be 25% faster. It isn't though.

[-]

CM0RDuck@reddit

Its a diff arch

[-]

CooperDK@reddit

Exactly right. Qwen is trying but slowly failing both when it comes to LLM and image generation. Their image generating model is far too large for what it can do.

[-]

CM0RDuck@reddit

What are you talking about, qwen3.6 27b is a game changer

[-]

CooperDK@reddit

Not really, compared to the current other models. And it depends on what you use it for.

[-]

CM0RDuck@reddit

Sure bud. A 27b dense model outperforming models ten times it size is no biggy.
Its easy to hold opinions when you are as vague as you are being. Honestly, its probably just user error on your end.

[-]

KURD_1_STAN@reddit

What 270b dense model is qwen 27b beating? I know there isnt any such new models anymore. But lets be real it good, but not 70b dense good, let alone 270b. U better mot mention mistral.

[-]

CM0RDuck@reddit

Benchmarks across many many domains of expertise are completely FREE online. Knock yourself out. Or dont and be behind, i dont care.👍

[-]

GoodTip7897@reddit

Yeah if Gemma was gated delta net then a4b would be slower than qwen 35b and 31b would be slower than qwen 27b

[-]

GoodTip7897@reddit

It's roughly proportional when you go from 4b to 27b. But not so much for smaller sizes... Also I think delta gated net kernels aren't as optimized as traditional iSWA. It's not like any inference software reaches much more than 90% of theoretical bandwidth.

I do get higher bandwidth with Gemma 4 than qwen. (31b runs faster than 27b).

[-]

KURD_1_STAN@reddit

Q 256 experts vs G 128. Qwen needs more offloading and loading if it doesnt fit in ram. it shouldn't be the case for cpu as u dont need that, altho not sure how that work with cache.

[-]

Wonderful-Pie-4940@reddit

Gemma 4 is a moe model and most probably you are running the e4B model which basically means at inference time only 4B params are active

[-]

DigitalguyCH@reddit

what speed is your RAM?

[-]

JackStrawWitchita@reddit (OP)

inxi -minxi -m

(base) dell@dell-OptiPlex-3060:\~$ inxi -

Memory:

System RAM: total: 32 GiB available: 31.17 GiB used: 4.45 GiB (14.3%)

Array-1: capacity: 32 GiB slots: 2 modules: 2 EC: None

Device-1: DIMM1 type: DDR4 size: 16 GiB speed: 2400 MT/s

Device-2: DIMM2 type: DDR4 size: 16 GiB speed: 2400 MT/s

[-]

DigitalguyCH@reddit

wow it's not even DDR5... I struggle with 3-4 t/s on my DDR5 with over double the speed, but I guess it also depends on prompt and context, maybe you should make an example for us to compare

[-]

VoiceApprehensive893@reddit

20 tokens/second on ddr5 ram is really nice especially since this model actually can get a lot of things people use llms for done

[-]

APFrisco@reddit

Out of curiosity what do you use the models you run on your CPU for? Experimentation or something else?

I really like CPU inference, it’s such a great way to be able to run models that wouldn’t fit fully on my GPU.

[-]

JackStrawWitchita@reddit (OP)

Experimentation, just to learn the basics of this stuff.

I was looking to buy a GPU, saw the prices and couldn't really justify it. So I started messing around with smaller models on my old gear and here we are.

[-]

APFrisco@reddit

Nice, yeah good plan!

[-]

pmttyji@reddit

Of course MOE models(Small/Medium particularly) could run at decent speed just with CPU-only inference. In past, I did post a thread on this which has both MOE & Dense models.

CPU-only LLM performance - t/s with llama.cpp

[-]

SethMatrix@reddit

really fast

X

[-]

JackStrawWitchita@reddit (OP)

How much did you spend on your GPU?

[-]

SethMatrix@reddit

The server has my old gpu, an rtx 3080 from Facebook marketplace for $400.

[-]

SettingAgile9080@reddit

Haven't tried CPU inference for a while and back then (6 months?) it was painfully slow, interesting to see these MoE models running sort-of acceptably well on CPU.

Did a full bench sweep (custom self-improving script generated with Claude Code/Opus 4.7) on Gemma 4 26B-A4B Q4_K_XL on an i7-14700K + 96GB DDR5, CPU only via llama.cpp Docker image.

Real-world server numbers (warmed up, \~200 tok prompt → 300 tok gen):

Prompt Processing (PP): \~90 tok/s

Token Generation (TG): \~13 tok/s

Bench notes:

TG is bandwidth-bound and peaks at 8 threads (one per P-core, no HT). PP is compute-bound and keeps scaling all the way to 28 threads (using the slower E-cores). Use --threads 8 --threads-batch 28 in llama-server and you get both peaks from the same process. Setting threads=8 for everything caps PP at \~73; threads=28 for everything tanks TG to \~11. For short interactive prompts, forcing to P-cores might be worthwhile but not worth it for longer or background tasks.
docker --cpuset-cpus=0-15 to force everything onto P-cores looked great in synthetic bench (80 PP / 14.5 TG) — but in real serving PP collapsed to 44 tok/s. OpenMP HT contention under HTTP + sampling load. So llama-bench numbers don't always translate to live serving.
Stuff that didn't do much: mmap on/off, ubatch 256/512/1024 (within noise; <256 hurts), Flash Attn \~+2%. KV cache: stick with f16 unless you also use -fa 1 (q8_0/q4_0 KV refuse to load without flash-attn).
Using btop (beautiful TUI) this didn't seem to max out my full CPU, just individual cores. Surprisingly there wasn't much temperature spiking.

For OP on i5-8500 + DDR4: expect roughly half the TG (\~6-7 tok/s) since dual-channel DDR4 is \~40 GB/s vs DDR5's \~80, and PP will be lower again because of fewer cores. Would need \~22GB to load this into memory and have 128K context. Still very usable for an "almost-27B" model.

My GPU isn't the hottest so wondering if this would be a good way to run a second (or more) model at the same time, so my background batch jobs where I don't care about speed can use the CPU during the hours when I am actively using the GPU.

Here's my serve-cpu.sh. This is tuned for my CPU, might need tweaking for other setups:

#!/usr/bin/env bash
#
# CPU-only serving config for Gemma 4 26B-A4B-it UD-Q4_K_XL
# MoE: 26B total / 3.8B active — well-suited to CPU inference
# Hardware: i7-14700K (8 P-cores w/HT = CPUs 0-15, 12 E-cores = CPUs 16-27)
#
# Benchmark + real-server findings (2026-05-05):
#   - Asymmetric thread counts win: TG is bandwidth-bound (peaks at ~8 threads),
#     PP is compute-bound (scales to all 28). --threads 8 --threads-batch 28
#     gives both peaks via the same process.
#   - fa=on:      +2% PP, +2.5% TG. Also enables quantized KV (which needs FA).
#   - ub=512:     optimal. PP collapses below 256.
#   - f16 KV:     no reason to quantize — 96GB RAM available.
#   - mmap:       no measurable difference with mlock active.
#   - ctx scale:  PP at 8K ≈ 59 tok/s (-20% vs 512). Linear-ish degradation.
#
# Real-server measurement (198-tok prompt → 300-tok gen, warmed up):
#   PP: 88-96 tok/s   TG: 12.4-13.7 tok/s
# 
# Pinning experiments (do NOT add these — they make things worse here):
#   - llama.cpp's --cpu-mask / --cpu-strict are SILENTLY IGNORED by this build
#     (OPENMP=1; OpenMP runs its own thread pool). Verified by inspecting
#     /proc/<pid>/task/*/status — every thread shows Cpus_allowed_list: 0-27
#     regardless of --cpu-mask setting.
#   - docker --cpuset-cpus=0-15 with -tb 16 caused PP to collapse to ~44 tok/s
#     in real serving (vs 80 in synthetic bench). HT contention on P-cores
#     under HTTP+sampling load. Not worth the +13% TG.
# 
# vs GPU (RTX 4000 SFF Ada, serve.sh): TG 60 → ~13 tok/s (~4.5× slower).
# Viable for batch / offline; painful for interactive chat.

CONTEXT=32768
MODEL=gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
MODEL_PATH=models/unsloth/gemma-4-26B-A4B-it-GGUF
MMPROJ=mmproj-BF16.gguf

docker run \
  --ipc=host \
  --shm-size=16g \
  -v ../../../models:/models \
  -p 11456:8080 \
  ghcr.io/ggml-org/llama.cpp:full \
  --server -m /${MODEL_PATH}/${MODEL} \
  --mmproj /${MODEL_PATH}/${MMPROJ} \
  --host 0.0.0.0 \
  --port 8080 \
  --flash-attn on \
  --ctx-size $CONTEXT \
  --cont-batching \
  -b 2048 \
  -ub 512 \
  --threads 8 \
  --threads-batch 28 \
  --n-gpu-layers 0 \
  --no-mmap \
  --mlock \
  --metrics \
  -np 1 \
  --cache-type-k f16 \
  --cache-type-v f16 \
  --jinja \
  --chat-template-kwargs '{"enable_thinking":false}' \
  --temp 1.0 --top-p 0.95 --top-k 64

[-]

CarlosEduardoAraujo@reddit

tok/sec??

[-]

JackStrawWitchita@reddit (OP)

Using Koboldcpp on Linux, I entered: "The car wash is 100 meters from my house. Should I walk to the car wash or drive there?"

Processing Prompt (30 / 30 tokens)

Generating (45 / 1024 tokens)

(Stop sequence triggered: User:)

[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s

Output: If you're just going for a quick car wash, walking might be easier and more environmentally friendly. However, if you have a lot of heavy equipment or are in a hurry, driving might be better.

[-]

CarlosEduardoAraujo@reddit

Can you try this prompt:

Irei fazer uma corrida de endurance, o tempo total é de 3 horas. O tempo de cada volta é de 1 minuto e 41 segundos. Com um tanque de gasolina, que tem capacidade de 100 litros, é possível fazer 36 voltas. Quantos litros irei precisar para fazer as 3 horas de corrida?

[-]

NightCulex@reddit

I love questions like this.

[-]

Maleficent-Ad5999@reddit

Please share t/s

[-]

Sooperooser@reddit

You can expect low double digit t/s for MoE 25-35b quant models and low single digit t/s for dense models of that size running consumer CPU only. I got like 16 t/s running Qwen3.6 35b a3b q4-6 quants on a r7 3700x and 64gb dual channel ddr4 ram with GPU use switched off.

[-]

KURD_1_STAN@reddit

Which 27b quant? im also getting like single digit with 3060 12gb even at q3

[-]

peligroso@reddit

3060 is anemic at inference speed even compared to 2080/3070. Gonna be hard to get a good setup with that driving your CUDAs.

[-]

KURD_1_STAN@reddit

~350GB/s compared to like what 50? Shouldn't be this close to a cpu

[-]

peligroso@reddit

whoosh

[-]

Sooperooser@reddit

Q4KM iirc

[-]

JackStrawWitchita@reddit (OP)

You answered it better than I but all I know is the 26B Gemma 4 is running faster than the 12B dense LLMs I usually run, and the CPU isn't even breaking a sweat.

From a user perspective, after the initial query, subsequent queries respond near instantaneously.

[-]

JackStrawWitchita@reddit (OP)

23.13T/s

[-]

LetsGoBrandon4256@reddit

[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s

23.13T/s is the prompt processing speed. The actual generation speed is 9.25T/s.

Still pretty usable for pure CPU inference on an i5 with DDR4

[-]

JackStrawWitchita@reddit (OP)

Thanks for the clarity!

[-]

ArchdukeofHyperbole@reddit

Yeah, I think it's amazing too. the moe models are more cpu friendly, like night and day compared to dense models. A dense 7B is barely usable on my pc and causes my pc to lag. For moe, basically just picking a model with 2B or 3B active parameters and you can get by. Even if it's a bit slower than using online models, it's incredible to have access to offline intelligence anyhow.

When I started getting into llms, I really wanted to use llama 70B but even at q1 quant, it didn't really work. Qwen next and others are faster and smarter than the models I initially wanted to run and I didn't have to buy hardware, just waited for llm efficiency gains basically.

[-]

BitGreen1270@reddit

I am surprised you are getting 23 t/s. I have a 32gb ram Ryzen 7 250 with 780m igpu and I'm getting roughly 18-20 t/s. I see gpu usage go up. So how come it's about the same? Does your system become less responsive when the llm is running?

[-]

LetsGoBrandon4256@reddit

[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s

OP confused the prefill speed with the generation speed. It's actually 9T/s

[-]

BitGreen1270@reddit

Ah that makes sense, thanks for sharing. I thought maybe I was running only on CPU 😄.

[-]

JackStrawWitchita@reddit (OP)

I've given kobobold 5 of my 6 threads so the LLM runs and I can still run a browser and stuff with no issues while the LLM chugs away.

[-]

BitGreen1270@reddit

Ah okay - that's smart. I'm glad you breathed new life into an older system. Now you can try tweaking it to squeeze as much performance as you can out of it.

[-]

Successful_Plant2759@reddit

The useful distinction here is total params versus active params, plus memory bandwidth. If Gemma 4 26B is MoE and only lights up a small slice per token, CPU-only can feel much better than a dense 26B. That is why tokens/sec, quant level, RAM speed, and batch size matter more than the headline parameter count. Would be great to see those numbers so people do not overgeneralize from this to every 26B model.

[-]

Bulky-Priority6824@reddit

Yea, I'm sure. Don't ever stop,

[-]

Hofi2010@reddit

I can’t believe that. Can you share a repo with how everything is setup so we can verify your results. And some more performance metrics like t/s and context window would be good to know

[-]

JackStrawWitchita@reddit (OP)

There's no repo to set up:

I've got an old Dell optiplex with an i5-8500, 32GB of RAM, running Linux 22.3 and using KoboldCPP with Kobold Lite front end. I've loaded gemma-4-26B-A4B-it-uncensored-heretic-Q3_K_L.gguf from HF and fired it off.

I entered: "The car wash is 100 meters from my house. Should I walk to the car wash or drive there?"

Processing Prompt (30 / 30 tokens)

Generating (45 / 1024 tokens)

(Stop sequence triggered: User:)

[14:39:05] CtxLimit:150/8192, Init:0.00s, Processed:30 in 1.30s (23.13T/s), Generated:45/1024 in 4.86s (9.25T/s), Total:6.16s

Output: If you're just going for a quick car wash, walking might be easier and more environmentally friendly. However, if you have a lot of heavy equipment or are in a hurry, driving might be better.

[-]

cosmos_hu@reddit

Sounds nice, imma test it later too :D