Looking for open source 10B model that is comparable to gpt4o-mini
Posted by bohemianLife1@reddit | LocalLLaMA | View on Reddit | 36 comments
Hi All, big fan of this community.
I am looking for a 10B model that is comparable to GPT4o-mini.
Application is simple it has to be coherent in sentence formation (conversational) i.e ability follow good system prompt (15k token length).
Good Streaming performance (TTFT, 600 ms).
Solid reliability on function calling upto 15 tools.
Some background:-
In my daily testing (Voice Agent developer) I found only one model till date which is useful in voice application. That is GPT4o-mini after this model no model in open / close has come to it. I was very excited for LFM model with amazing state space efficiency but I failed to get good system prompt adherence with it.
All new model again closed / open are focusing on intelligence (through reasoning) and not reliability with speed.
If anyone has proper suggestion it would help the most.
I am trying to put voice agent in single GPU.
ASR with https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1 (it's amazing takes 1GB of VRAM)
LLM <=== Need help!
TTS with https://github.com/ysharma3501/FastMaya (Maya 1 from maya research)
Hardware: 16GB 5060Ti
exaknight21@reddit
To be honest, I personally feel qwen3:4b-instruct is as good as gpt-4o-mini.
EndlessZone123@reddit
The knowledge doesn't seem compatible.
z_3454_pfk@reddit
use wiki tool and it’ll then have all the knowledge
EndlessZone123@reddit
How are people hooking these models to wikis or web? Llamacpp is cool and all but it doesn't have any addition features to make an LLM actually good.
luncheroo@reddit
The easiest way for me without tinkering too too much is LM Studio, installing MCP servers via CLI, then updating the config in LM Studio to call them. Far from perfect, but you can give a small model web and wikipedia search pretty quickly that way. I think a model with native tool calling and rolling your own solutions is probably superior, but I haven't pursued it.
Adventurous_Cat_1559@reddit
That’s the neat part, you can make your own whichever way you want (it’s basically what all the other things do, use llama.cpp and hack a vibe coded UI with tool calling / rag onto it)
National_Meeting_749@reddit
Vibe coding is still only for people who understand how to code.
I've done a lot of vibe coding, even using Claude pretty heavily, and I cannot make anything that actually works. Let alone anything actually useful, because I don't understand coding.
I could not hack together a working front end that has tool calling and rag built in.
I could spend hours and hours and hours talking to Claude and it would never truly be functional.
iChrist@reddit
You connect Llama CPP to OpenWebui / any other good LLM fronend. Or use LM studio which got everything covered
Miserable-Dare5090@reddit
https://lmstudio.ai/lmstudio/wikipedia
https://lmstudio.ai/danielsig/duckduckgo
Impossible-Power6989@reddit
I had the "genius" idea of using a web scraper to call Wikipedia and scrape the JSON for any topic directly (as wiki entries use to be accessible like that), then inject that into the chat, figuring it would be fast, nest and cool / not require much post process clean up.
Didn't work, sadly.
StardockEngineer@reddit
Llamacpp is the wrong layer. It’s for inferencing. You need a UI that has tools, like MCPs. The UI talks to llamacpp and the llm will ask the UI to call tools on its behalf.
If you’re just getting started, use LM Studio instead. It can manage all the parts for you - llamacpp, UI and tools.
h3wro@reddit
Not sure about others, but I would do RAG (Retrieval Augmented Generation) with embedding model to fetch data based on embedded query.
StardockEngineer@reddit
RAG is so terribly unreliable tho. And your RAG DB isn’t going to have world knowledge.
dkeiz@reddit
and then there even qwen3:8b-instruct!
bohemianLife1@reddit (OP)
qwen are great series but 4b will be little inadequate for the task.
exaknight21@reddit
My use case was to analyze and generate contracts. It performed better than my usual gpt-4o-mini chats.
dash_bro@reddit
15 tools, 32-48k context 100% recall, under 10B params????
Sorry, nothing matches these specs at the level of gpt-4o-mini. 10B is ridiculously low for saturating anything near gpt-4o-mini performance across the board.
You'll need to upgrade to atleast 30-50B range (qwen3-30B-A3B, qwen3-14B, kimi-linear-48B-A3B, gemma3-12B, glm-4-32B, seed-oss-36B) to see comparable results.
GLM in particular has been very good at tool calling for me. Personally, I'd pick either GLM, Qwen3-30B-A3B or kimi-linear-48B-A3B
Go to openrouter, for the same tasks that gpt-4o-mini does rn, try swapping between the above and seeing which one works the best for you.
Once you know, you can download it locally and run it.
bohemianLife1@reddit (OP)
I definitely agree, personally I have been looking for it for a while and wanted to leave this pursuit after this reddit post. I posted with hope if someone has some different perspective.
Also comparing GPT4o-mini performance with this non existing 10B model was on the basis that GPT4o-mini came in July 18, 2024. It's been 1.5 year we as AI community have surely pushed the boundaries for bigger smarter model but their is lesser focus bringing the SOTA gain into smaller models.
Lastly noted, your models surely giving it try.
Have been downloading models locally like dummy, thanks for openrouter tip.
Miserable-Dare5090@reddit
4o mini is a 30-50b model. you are asking what is free and better than 4o mini with less parameters and without an agentic harness like openAI puts on their models.
dash_bro@reddit
Technically speaking benchmark wise yes qwen3-14B dense should be close to gpt-4o-mini.
However it's more about the breadth and quality of data 4o was trained and then distilled on, which makes 4o mini so fantastic at its size. In other words, it's great because it's breadth of knowledge is from 4o.
It's honestly one of the most technically impressive models as far as I'm concerned
Miserable-Dare5090@reddit
https://x.com/rohanpaul_ai/status/1994509375470465039?s=46
Plenty of small Finetuned models with MCP tools will achieve comparable knowledge depth or ability to retrieve that much knowledge deph.
Qwen3 4 and 8B and their finetunes are natively able to process 260k
robonxt@reddit
I think you are vastly underestimating how much ram or VRAM is needed to run any proprietary models.
There's a reason why people spend thousands of dollars on their systems, or just use a cloud provider to run their choice of llm model.
That being said, you may want to try out: - granite 4 series (check out the 4H-tiny, I got 32k context working under 8GB of VRAM) - llama 3 series (the small 8gb ones. Some swear by the original, some 3.1, some 3.2 and rarely 3.3) - qwen2/qwen3 series (the smaller models)
Otherwise, good luck, because you might want to upgrade in the near future, or compromise somewhere
Mbcat4@reddit
?? With 8 gb vram and 32 gb ram I run qwen3 vl 30b, there's no need for small models
bohemianLife1@reddit (OP)
Tried granite 4H-tiny (Q8 with llama.cpp server 8k context).
It adhere the prompt almost perfectly. I'll try tool calling.
Fox-Lopsided@reddit
Qwen Omni ?
thebadslime@reddit
What about an MoE? ERNIE 4.5 21BA3B is on par with gpt4 regular.
bohemianLife1@reddit (OP)
Will it fit on 16GB card?
Nixellion@reddit
In my experience with agentic tasks in my own assistants - 14B models work well in 16GB. Qwen3 or Gemini 3. Both are great.
Gemini 4B is also quite powerful, I used it as Deep Research agent, it was capable of properly calling web search and web open tools I provided to gather data and summarizing it for the larger model to reason about.
reginakinhi@reddit
For anyone reading this, by Gemini they presumably mean Google's open source offering: Gemma here.
Nixellion@reddit
Oops, yes, thanks for the correction.
thebadslime@reddit
MoE also uses system ram, i get 25 tps with a 4gb card and 32gb ram.
Lucky-Necessary-8382@reddit
There should be some fintetunes of qwen3 with 4o outputs
pmttyji@reddit
Though it's not 10B, here some models for your 16GB VRAM.
With additional RAM, you could go for higher quants & also go with Qwen3-30B MOE models. I run Qwen3-30B(Q4) with just 8GB VRAM + 32GB RAM.
lly0571@reddit
Gemma3-12B-it might be okay, but may fell short when compared with 4o-mini.
You can get maybe \~40t/s with 8-10GB remaining vRAM on Qwen3-30B-A3B with llama.cpp.
beppled@reddit
oki so you mentioned you need an agent .. it depends on how tool heavy and image understanding heavy your workflow is ... i'd recommend jan for mcp use ..
unless you want something nsfw, factory tuned models like gemma 3n e4b would actually work perfectly ... qwen has a long cot problem, not snappy enough.
i've never gotten it to work but this gemma model accepts audio and video natively too, you can try your luck ...
bohemianLife1@reddit (OP)
I thought of gemma model, 4b I'll give a try but I assume will be little short for the work tested 27B and it is amazing but doesn't fit in hardware requirements.