Would a fully open SmolLM4-750M with 16K context make sense?
Posted by Ok-Type-7663@reddit | LocalLLaMA | View on Reddit | 14 comments
I’ve been thinking about a possible gap in the current small local model space: a modern fully open \~750M model.
Hugging Face already has SmolLM2 at 135M, 360M, and 1.7B, and SmolLM3 pushes the family to 3B with long context, multilingual support, and reasoning. The Smol Models repo also describes the goal pretty clearly: fully open, compact models that can run effectively on-device while still having strong performance.
So my idea is:
SmolLM4-750M
High-level target:
- \~750M parameters
- 16K context
- Causal LM
- Fully open weights
- Fully open data recipe
- Training/eval details public
- Apache-2.0 if possible
- Main languages: English + Spanish
- Built for local inference, weak hardware, students, hobbyists, and small-device experiments
I’m intentionally not suggesting exact architecture internals like layer count, FFN size, attention heads, RoPE settings, etc. Hugging Face would know better how to design that. I’m more interested in whether the size class itself makes sense.
Why 750M?
To me, it feels like a missing middle point:
- 135M / 360M are cool but often too limited
- 1.7B is much better but heavier
- 3B is already a different class for weak machines
- \~750M could be a sweet spot for low-RAM CPU inference, fast testing, small fine-tunes, education, and “actually usable but still tiny” local workflows
Possible dataset direction:
- HuggingFaceTB/smollm-corpus
- HuggingFaceFW/fineweb-edu
- HuggingFaceTB/finemath
- HuggingFaceTB/stack-edu
- HuggingFaceTB/smoltalk2
- HuggingFaceTB/cosmopedia
- HuggingFaceFW/fineweb-2, Spanish subset spa_Latn
- open-thoughts/OpenThoughts-114k
- HuggingFaceTB/smol-smoltalk
The goal would not be to beat 3B models. The goal would be a very clean, open, practical sub-1B model that is stronger than ultra-tiny models and easier to run than 1.7B/3B.
Questions for r/LocalLLaMA:
Would \~750M be a useful size class, or is it too awkward between 360M and 1.7B?
Would 16K context be realistic/useful at this size?
Would you prefer this kind of model to focus on:
- general chat
- coding
- math/reasoning
- multilingual
- low-RAM CPU inference
- mobile/on-device use
And what benchmarks would actually matter for a model this small?
(Note: this text was generated by GPT-5.5 Thinking. I am a human. Don't say "ai slop". Just respond questions)
Middle_Bullfrog_6173@reddit
For CPU inference a small MoE would be much more useful than a dense model. There are many tiny dense models to choose from and you can always choose a quantized version of a larger model if you need to fit a specific byte size.
Badger-Purple@reddit
What exactly would an MoE do with such low number of parameters per expert?
Middle_Bullfrog_6173@reddit
Same thing as a dense model but with more total parameters?
Badger-Purple@reddit
not exactly, the experts are too small to converge on real answers. But maybe one of the bonsai 8b moe models 1bit would work.
Middle_Bullfrog_6173@reddit
Why would the experts be too small?
Take a toy example: 1B dense with half the parameters in MLPs. Now instead, turn it into a MoE with 1 active expert out of 8 and 8x500 in experts (4.5B total, keeping embeddings and attention the same). Now you have exactly the same capacity in each expert as the dense MLP has.
There are several MoE models around with about 1B active and there are countless dense models with less than that total.
Badger-Purple@reddit
He wants sub 1B total with experts
Middle_Bullfrog_6173@reddit
The OP wanted a small dense model. I was arguing that a MoE with equivalent active parameters makes more sense if the goal is CPU inference speed. You get similar speed but a more useful model.
Revolutionalredstone@reddit
yeah this is needed bad, a lot of tiny tasks on tiny decides need 10% of qwen3.6's agentic abilities but > 5000% of qwen 3.5's ;D
The next wave of well tuned / harness adapted fast local coding models is going to be a bloodbath for OpenAI / Claude sub counts.
Only_Play_868@reddit
I'd experiment with it, could easily find value for things like summarization, zero-shot classification, data extraction (if structured output is supported), matching, labeling, etc. At 750M I'd prefer it be mono-lingual (English) and text-only. I'd also be curious how quantization affects performance & size at this scale.
For context, I've been using Apple Intelligence to build on-device apps and it's actually OK. I know it's a 3B model but in my experience, it's much weaker than most 3B models today and it's safety guardrails are a pain. It's also Apple Silicon and macOS/ iOS 26+ only. I'd love a small fallback for non-Apple Silicon, pre-26 devices. I use a handful of small models to support some of these use cases, but they're far less flexible
BevinMaster@reddit
How slow is Qwen3.5-0.8B to use ?
Silver-Champion-4846@reddit
10t/s on my cpu. But it does nothing good on my end with Jan, it generates wrong toolcall syntax
Dany0@reddit
ai slop ai slop
WyattTheSkid@reddit
From my experience, anything below 1b parameters can still be very useful, but is more suited for doing one specific thing really well rather than being a chat model. Like using for text classification or what the grammarly models do for example. A pretrained model on 750m tokens would definitely have potential to be a good grammar correction model or classifier or something but if you by awkward you meant awkward for general inference/conversational tasks, then yeah don’t expect it to go well for that
fgp121@reddit
This seems like a solid gap to fill - 750M could definitely hit a sweet spot for CPU inference. For benchmarking, I've been using Neo to run evaluation workflows across different model sizes and it handles the testing/iteration pretty smoothly. For a model this small, I'd focus on IFEval, MBPP, and maybe multilingual benchmarks since you're targeting English+Spanish.