Would a fully open SmolLM4-750M with 16K context make sense?

[-]

Middle_Bullfrog_6173@reddit

For CPU inference a small MoE would be much more useful than a dense model. There are many tiny dense models to choose from and you can always choose a quantized version of a larger model if you need to fit a specific byte size.

[-]

Badger-Purple@reddit

What exactly would an MoE do with such low number of parameters per expert?

[-]

Middle_Bullfrog_6173@reddit

Same thing as a dense model but with more total parameters?

[-]

Badger-Purple@reddit

not exactly, the experts are too small to converge on real answers. But maybe one of the bonsai 8b moe models 1bit would work.

[-]

Middle_Bullfrog_6173@reddit

Why would the experts be too small?

Take a toy example: 1B dense with half the parameters in MLPs. Now instead, turn it into a MoE with 1 active expert out of 8 and 8x500 in experts (4.5B total, keeping embeddings and attention the same). Now you have exactly the same capacity in each expert as the dense MLP has.

There are several MoE models around with about 1B active and there are countless dense models with less than that total.

[-]

Badger-Purple@reddit

He wants sub 1B total with experts

[-]

Middle_Bullfrog_6173@reddit

The OP wanted a small dense model. I was arguing that a MoE with equivalent active parameters makes more sense if the goal is CPU inference speed. You get similar speed but a more useful model.

[-]

Revolutionalredstone@reddit

yeah this is needed bad, a lot of tiny tasks on tiny decides need 10% of qwen3.6's agentic abilities but > 5000% of qwen 3.5's ;D

The next wave of well tuned / harness adapted fast local coding models is going to be a bloodbath for OpenAI / Claude sub counts.

[-]

Only_Play_868@reddit

I'd experiment with it, could easily find value for things like summarization, zero-shot classification, data extraction (if structured output is supported), matching, labeling, etc. At 750M I'd prefer it be mono-lingual (English) and text-only. I'd also be curious how quantization affects performance & size at this scale.

For context, I've been using Apple Intelligence to build on-device apps and it's actually OK. I know it's a 3B model but in my experience, it's much weaker than most 3B models today and it's safety guardrails are a pain. It's also Apple Silicon and macOS/ iOS 26+ only. I'd love a small fallback for non-Apple Silicon, pre-26 devices. I use a handful of small models to support some of these use cases, but they're far less flexible

[-]

BevinMaster@reddit

How slow is Qwen3.5-0.8B to use ?

[-]

Silver-Champion-4846@reddit

10t/s on my cpu. But it does nothing good on my end with Jan, it generates wrong toolcall syntax

[-]

Dany0@reddit

ai slop ai slop

[-]

WyattTheSkid@reddit

From my experience, anything below 1b parameters can still be very useful, but is more suited for doing one specific thing really well rather than being a chat model. Like using for text classification or what the grammarly models do for example. A pretrained model on 750m tokens would definitely have potential to be a good grammar correction model or classifier or something but if you by awkward you meant awkward for general inference/conversational tasks, then yeah don’t expect it to go well for that

[-]

fgp121@reddit

This seems like a solid gap to fill - 750M could definitely hit a sweet spot for CPU inference. For benchmarking, I've been using Neo to run evaluation workflows across different model sizes and it handles the testing/iteration pretty smoothly. For a model this small, I'd focus on IFEval, MBPP, and maybe multilingual benchmarks since you're targeting English+Spanish.