Looking for open source 10B model that is comparable to gpt4o-mini

Posted by bohemianLife1@reddit | LocalLLaMA | View on Reddit | 36 comments

Hi All, big fan of this community.

I am looking for a 10B model that is comparable to GPT4o-mini.
Application is simple it has to be coherent in sentence formation (conversational) i.e ability follow good system prompt (15k token length).
Good Streaming performance (TTFT, 600 ms).
Solid reliability on function calling upto 15 tools.

Some background:-

In my daily testing (Voice Agent developer) I found only one model till date which is useful in voice application. That is GPT4o-mini after this model no model in open / close has come to it. I was very excited for LFM model with amazing state space efficiency but I failed to get good system prompt adherence with it.

All new model again closed / open are focusing on intelligence (through reasoning) and not reliability with speed.

If anyone has proper suggestion it would help the most.

I am trying to put voice agent in single GPU.
ASR with https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1 (it's amazing takes 1GB of VRAM)
LLM <=== Need help!
TTS with https://github.com/ysharma3501/FastMaya (Maya 1 from maya research)

Hardware: 16GB 5060Ti

[-]

exaknight21@reddit

To be honest, I personally feel qwen3:4b-instruct is as good as gpt-4o-mini.

[-]

EndlessZone123@reddit

The knowledge doesn't seem compatible.

[-]

z_3454_pfk@reddit

use wiki tool and it’ll then have all the knowledge

[-]

EndlessZone123@reddit

How are people hooking these models to wikis or web? Llamacpp is cool and all but it doesn't have any addition features to make an LLM actually good.

[-]

luncheroo@reddit

The easiest way for me without tinkering too too much is LM Studio, installing MCP servers via CLI, then updating the config in LM Studio to call them. Far from perfect, but you can give a small model web and wikipedia search pretty quickly that way. I think a model with native tool calling and rolling your own solutions is probably superior, but I haven't pursued it.

[-]

Adventurous_Cat_1559@reddit

That’s the neat part, you can make your own whichever way you want (it’s basically what all the other things do, use llama.cpp and hack a vibe coded UI with tool calling / rag onto it)

[-]

National_Meeting_749@reddit

Vibe coding is still only for people who understand how to code.

I've done a lot of vibe coding, even using Claude pretty heavily, and I cannot make anything that actually works. Let alone anything actually useful, because I don't understand coding.

I could not hack together a working front end that has tool calling and rag built in.

I could spend hours and hours and hours talking to Claude and it would never truly be functional.

[-]

iChrist@reddit

You connect Llama CPP to OpenWebui / any other good LLM fronend. Or use LM studio which got everything covered

[-]

Miserable-Dare5090@reddit

https://lmstudio.ai/lmstudio/wikipedia

https://lmstudio.ai/danielsig/duckduckgo

[-]

Impossible-Power6989@reddit

I had the "genius" idea of using a web scraper to call Wikipedia and scrape the JSON for any topic directly (as wiki entries use to be accessible like that), then inject that into the chat, figuring it would be fast, nest and cool / not require much post process clean up.

Didn't work, sadly.

[-]

StardockEngineer@reddit

Llamacpp is the wrong layer. It’s for inferencing. You need a UI that has tools, like MCPs. The UI talks to llamacpp and the llm will ask the UI to call tools on its behalf.

If you’re just getting started, use LM Studio instead. It can manage all the parts for you - llamacpp, UI and tools.

[-]

h3wro@reddit

Not sure about others, but I would do RAG (Retrieval Augmented Generation) with embedding model to fetch data based on embedded query.

[-]

StardockEngineer@reddit

RAG is so terribly unreliable tho. And your RAG DB isn’t going to have world knowledge.

[-]

dkeiz@reddit

and then there even qwen3:8b-instruct!

[-]

bohemianLife1@reddit (OP)

qwen are great series but 4b will be little inadequate for the task.

[-]

exaknight21@reddit

My use case was to analyze and generate contracts. It performed better than my usual gpt-4o-mini chats.

[-]

dash_bro@reddit

15 tools, 32-48k context 100% recall, under 10B params????

Sorry, nothing matches these specs at the level of gpt-4o-mini. 10B is ridiculously low for saturating anything near gpt-4o-mini performance across the board.

You'll need to upgrade to atleast 30-50B range (qwen3-30B-A3B, qwen3-14B, kimi-linear-48B-A3B, gemma3-12B, glm-4-32B, seed-oss-36B) to see comparable results.

GLM in particular has been very good at tool calling for me. Personally, I'd pick either GLM, Qwen3-30B-A3B or kimi-linear-48B-A3B

Go to openrouter, for the same tasks that gpt-4o-mini does rn, try swapping between the above and seeing which one works the best for you.

Once you know, you can download it locally and run it.

[-]

bohemianLife1@reddit (OP)

I definitely agree, personally I have been looking for it for a while and wanted to leave this pursuit after this reddit post. I posted with hope if someone has some different perspective.

Also comparing GPT4o-mini performance with this non existing 10B model was on the basis that GPT4o-mini came in July 18, 2024. It's been 1.5 year we as AI community have surely pushed the boundaries for bigger smarter model but their is lesser focus bringing the SOTA gain into smaller models.

Lastly noted, your models surely giving it try.
Have been downloading models locally like dummy, thanks for openrouter tip.

[-]

Miserable-Dare5090@reddit

4o mini is a 30-50b model. you are asking what is free and better than 4o mini with less parameters and without an agentic harness like openAI puts on their models.

[-]

dash_bro@reddit

Technically speaking benchmark wise yes qwen3-14B dense should be close to gpt-4o-mini.

However it's more about the breadth and quality of data 4o was trained and then distilled on, which makes 4o mini so fantastic at its size. In other words, it's great because it's breadth of knowledge is from 4o.

It's honestly one of the most technically impressive models as far as I'm concerned

[-]

Miserable-Dare5090@reddit

https://x.com/rohanpaul_ai/status/1994509375470465039?s=46

Plenty of small Finetuned models with MCP tools will achieve comparable knowledge depth or ability to retrieve that much knowledge deph.

Qwen3 4 and 8B and their finetunes are natively able to process 260k

[-]

robonxt@reddit

I think you are vastly underestimating how much ram or VRAM is needed to run any proprietary models.

There's a reason why people spend thousands of dollars on their systems, or just use a cloud provider to run their choice of llm model.

That being said, you may want to try out: - granite 4 series (check out the 4H-tiny, I got 32k context working under 8GB of VRAM) - llama 3 series (the small 8gb ones. Some swear by the original, some 3.1, some 3.2 and rarely 3.3) - qwen2/qwen3 series (the smaller models)

Otherwise, good luck, because you might want to upgrade in the near future, or compromise somewhere

[-]

Mbcat4@reddit

?? With 8 gb vram and 32 gb ram I run qwen3 vl 30b, there's no need for small models

[-]

bohemianLife1@reddit (OP)

Tried granite 4H-tiny (Q8 with llama.cpp server 8k context).
It adhere the prompt almost perfectly. I'll try tool calling.

[-]

Fox-Lopsided@reddit

Qwen Omni ?

[-]

thebadslime@reddit

What about an MoE? ERNIE 4.5 21BA3B is on par with gpt4 regular.

[-]

bohemianLife1@reddit (OP)

Will it fit on 16GB card?

[-]

Nixellion@reddit

In my experience with agentic tasks in my own assistants - 14B models work well in 16GB. Qwen3 or Gemini 3. Both are great.

Gemini 4B is also quite powerful, I used it as Deep Research agent, it was capable of properly calling web search and web open tools I provided to gather data and summarizing it for the larger model to reason about.

[-]

reginakinhi@reddit

For anyone reading this, by Gemini they presumably mean Google's open source offering: Gemma here.

[-]

Nixellion@reddit

Oops, yes, thanks for the correction.

[-]

thebadslime@reddit

MoE also uses system ram, i get 25 tps with a 4gb card and 32gb ram.

[-]

Lucky-Necessary-8382@reddit

There should be some fintetunes of qwen3 with 4o outputs

[-]

pmttyji@reddit

Though it's not 10B, here some models for your 16GB VRAM.

GPT-OSS-20B
Ling/Ring Mini, Ling/Ring Lite (17B, Q6 fits)
Ernie 4.5 (21B, Q4 fits)
Gemma3-12B
Qwen3-14B
Mistral/Devstral/Magistral 24B models (Q4 fits)

With additional RAM, you could go for higher quants & also go with Qwen3-30B MOE models. I run Qwen3-30B(Q4) with just 8GB VRAM + 32GB RAM.

[-]

lly0571@reddit

Gemma3-12B-it might be okay, but may fell short when compared with 4o-mini.

You can get maybe \~40t/s with 8-10GB remaining vRAM on Qwen3-30B-A3B with llama.cpp.

[-]

beppled@reddit

oki so you mentioned you need an agent .. it depends on how tool heavy and image understanding heavy your workflow is ... i'd recommend jan for mcp use ..

unless you want something nsfw, factory tuned models like gemma 3n e4b would actually work perfectly ... qwen has a long cot problem, not snappy enough.

i've never gotten it to work but this gemma model accepts audio and video natively too, you can try your luck ...

[-]

bohemianLife1@reddit (OP)

I thought of gemma model, 4b I'll give a try but I assume will be little short for the work tested 27B and it is amazing but doesn't fit in hardware requirements.