Best small model right now (~4B params) that is good with agentic tasks for personal assistant?
Posted by BitGreen1270@reddit | LocalLLaMA | View on Reddit | 49 comments
Looking for suggestions. I have been experimenting with gemma-4-E2B and gemma-4-E4B but the tool calling has been not the best? My tasks are just things like:
- Update calendar
- Get my schedule
- Send a WA message at 4PM
etc.
Any suggestions? If it helps, here are my server params:
./llama-server \
--host 0.0.0.0 \
--port 8080 \
-m ~/myp/models/google_gemma-4-E4B-it-Q8_0.gguf \
--temp 1.0 \
--top_p 0.95 \
--top_k 64 \
-c 65536 \
--flash-attn on \
-t 16 \
--ctx-checkpoints 4 \
--cache-ram 16384 \
--chat-template-file /home/lenny/myp/models/jinja/gemma4-improved.jinja \
-ngl 99
Spare-Leadership-895@reddit
qwen 3.5 4b is the first one i'd try.
for this kind of thing i'd score it on a tiny eval set instead of general chat quality: calendar edit, ambiguous time, missing contact, "don't send yet". if the tool args are wrong, the model choice won't save you. I'm building qordinate around this kind of assistant load, and the useful version so far is a thin model plus deterministic checks before anything fires.
Wide_Big_6969@reddit
I think qwen 3.5 4b is still the best
synw_@reddit
Yes. I hope we'll get a refresh of this one at some point: it's the most reliable 4b for tool calls
AnticitizenPrime@reddit
I wonder if this is a situation in which fine tuning on your tool frameworks would be beneficial.
BitGreen1270@reddit (OP)
I wouldn't know where to begin. Especially for tool calls.
sahanpk@reddit
for 4B i’d keep the model out of final execution: let it choose from a tiny intent schema, then deterministic code validates args/time/contact before tools run.
BitGreen1270@reddit (OP)
Yea - I've put in safeguards and requirements as much as I can - but it sometimes misses tool calls while telling me it has already called the tool.
Not_your_guy_buddy42@reddit
No no you build a harness. Don't go "here's tools and info and just do everything please." Small model will mostly just get horrified. Like you're asking a horse to do math.
You write as much as logic you can via a code harness (symbolic). LLM (neuro) just to do atomic tasks within that. go neuro-symbolic bro
BitGreen1270@reddit (OP)
That's what I tried to do. The tools are very narrow and focused and with lots of safeguards in the code itself. The llms job is to translate natural language into the specific tool call.
Not_your_guy_buddy42@reddit
Not what I said
BitGreen1270@reddit (OP)
Sorry - re-read your post. You meant to keep the LLM involvement as minimal as possible. Most of the work should be done by code and the LLM should only be called as backup if the regular non-llm code (or ML model) couldn't get the job done. That makes sense and I might end up going that route eventually. But I'm still learning about LLMs and want to see how far I can push them.
Not_your_guy_buddy42@reddit
That's it almost. But a hybrid where each part does what it's actually good at.
Step 1. -> Check user message for blatant keywords -> fast path routes
Step 2. -> Failed step 1, now ask LLM what tool user wanted. JUST route it
Step 3. -> Different paths per selected tool.
Step 4. -> Ask LLM (with different prompts per path!) what info needs to go into the tool call
Step 5 -> actually parse that shit
This is not even "tool calls" and I have this working since beginning of 2025 with really dumb models, before the word "harness"
BitGreen1270@reddit (OP)
Thanks for sharing. But this restricts it to very specific tool flows. How do you do the part of brainstorming or chatting with the LLM?
Not_your_guy_buddy42@reddit
You brainstorm those tasks?
BitGreen1270@reddit (OP)
Not for those tasks specifically, but if I ask it to do some web research.
MaruluVR@reddit
You can build something like this using "guidance AI" basically its a way to force your model to output multiple choice instead of the token it wants to. That way the ai can only choose between the things your code expects. I have used this before for tests in ai NPCs and it works really well but you want to add a multiple options for none of these apply, or no tool etc
https://github.com/guidance-ai/guidance
CommonPurpose1969@reddit
Tool calling performance for Gemma 4 models (E2B and E4B) has been discussed here, and the consensus was that it was worse than Qwen 3.5 (4B and 2B). Gemma 4 on the other hand is way more creative than Qwen 3.5
FoxiPanda@reddit
I don't think this exists yet but I'd love to be wrong. Tool calling reliability seems to require more parameters in my experience so far, but the only thing I can think of that might be interesting to try would be LFM2.5-8B-A1B ( https://huggingface.co/LiquidAI/LFM2.5-8B-A1B ) or maybe Qwen3.5-4B or 9B.
Jipok_@reddit
Be careful with the LFM2.5-8B-A1B. It's a complete waste of time. I don't know how they got their benchmark numbers. I tried the Q8 model, and it can't even produce the correct tiik call format specified in the template. I was only able to get it to work by manually generating a prompt and parsing the model's response.
BitGreen1270@reddit (OP)
Yea I played a bit with it just now and I'm leaning towards agreeing with your views.
BitGreen1270@reddit (OP)
Of course I have a fallback on Qwen3.6-35B-A3B or Gemma4-26B-A4B, but those run slower on my laptop. Thanks for sharing that - I'll give it a try. The model card emphasizes tool calling. Also, I think it only supports 9 languages or something, which is fine I guess as long as it does reliable tool calling.
Jipok_@reddit
> gemma-4-E4B but the tool calling has been not the best
Well, it doesn't go crazy for me and call tools normally. Try unsloth_gemma-4-E4B-it-UD-Q8_K_XL.
There may be a problem with your "improved" template.
BitGreen1270@reddit (OP)
Thanks are you using the standard template? The improved one worked really well with 31B in pi so I kept it.
Jipok_@reddit
If you give me your specific request and a description of the tools, I can try it with my prompt generator. The problem might actually be with how the template works. You can find me on Telegram; same username, but without the visible space at the end.
Jipok_@reddit
To be honest, I didn't use the E4B version for very long and switched to
unsloth_gemma-4-A4B-it-UD-Q8_K_XL.gguf.For some reason, E4B gives me about half the tokens per second compared to A4B. Might be a llama.cpp quirk, but I didn't dig too deep into it.Even on that model, I still ran into tool-calling issues when using the default/stock template. That's exactly why I manually construct the raw prompts and bypass the templates entirely via the
/v1/completionsendpoint.pajuch@reddit
What agent are you using? Some are super bloated so could overwhelm the small models
BitGreen1270@reddit (OP)
Made one from scratch so it's hyper focused on my requirements. But the constraints are bloating up the prompt so maybe that's why it's getting less reliable.
pajuch@reddit
Nice, a basic Hermes config is like 15k tokens first call so probably not worse than that. I think you should be able to get it to work with those simple tasks to be honest. In addition to what others are saying try decreasing temp a lot and top k a bit
BitGreen1270@reddit (OP)
Mine is nowhere near 15k lol. Good to know that's the reference. Mine is about 1k at best.
Secret_Theme3192@reddit
For personal-assistant tasks, I’d test the model less on chat quality and more on boring tool-call reliability. Make a small eval set with calendar edits, ambiguous times, missing fields, and “don’t send yet” cases. A 4B model can feel fine in conversation but still be risky if it calls the right tool with the wrong arguments. I’d also keep destructive actions behind confirmation no matter which model you pick.
BitGreen1270@reddit (OP)
This is a good idea. I'll look into setting up an eval for this. Thank you 👍
GrungeWerX@reddit
Try Qwen 3.5 4B
BitGreen1270@reddit (OP)
Thank you - someone else also recommended this, I'll try it out.
ReferenceOwn287@reddit
I’ve had good success for such tool calls with Qwen 3.5 2B model. However, you’d need an embedding model and exemplars for reliable tool calls. Directly asking the model for tool calls will be extremely unreliable. Also use the model’s inbuilt tool schema and don’t invent your own.
BitGreen1270@reddit (OP)
Thanks can you elaborate more on the embedding model and how to use it? Also how do you mean the models tool schema? Isn't it already baked into llama.cpp?
keen23331@reddit
have you tried; https://huggingface.co/prism-ml/Ternary-Bonsai-8B-gguf
it's an 8B model but its a 1.58/2 bit model and thus requires mutch less VRAM or RAM and is mutch faster than even 4B Models usually. the catch you need to compile llama.cpp by yourself since it's experimental and not yet in the main tree from llama.cpp. git tree: https://github.com/PrismML-Eng/llama.cpp
BitGreen1270@reddit (OP)
Thank you! Will try this out as well. Model card doesn't say much about tool calling though.
belsamber@reddit
I’m surprised there’s not much mention of Qwen 3.5 4B. I have it set up with a few tools, one with a small query DSL, and it’s working better than I thought it would. It’s not opus by any means, but it’s functional.
BitGreen1270@reddit (OP)
Oh I didn't know about this - I'll take it for a spin, thank you!
arnav080@reddit
ive made an open-s tool to make sharing and running these optimised recipes like these easier and instant [bloc-theta.vercel.app]
Far-Cookie2275@reddit
Lowest one with tool calling I can think of is qwen3-8b not sure how well it will work or how capable it is but tested it on tidying up directory and it worked
BitGreen1270@reddit (OP)
This was released a year ago - that's decades in AI world 😞
Wrong_Mushroom_7350@reddit
Ok I am currently testing gemma 4 e4b it quant 8. I have not really ran through the paces, I did try a one shot attempt to re create a flappy bird clone, even with the code provided, and i did not have any luck.. But, I just installed it and tried it.. I need more time to work with it.. I admit the prompt was not that great and I have a locked down sandbox, so it errored a lot. I have 131k context for gemma 4.
I have a 4080 super and my main ran qwen 3.6 35b a3b mtp.. but i only had 16k context, and I needed something more.
BitGreen1270@reddit (OP)
I think coding might be a tall order for the E4B. I'm just trying simple tool calls which are not even multi-step.
Wrong_Mushroom_7350@reddit
I think you might be correct, but I have to try. Each time I try a new model I learn something new, and improve on the process.
BitGreen1270@reddit (OP)
I have a 5090 and I run out of memory as well. I recently found out about --kv-offload which pushes the kv cache to system RAM. It slows down the tps but greatly improves reliability and you can go higher context with it. Also with --ctk and --ctv at q8_0 and q4_0 respectively.
Wrong_Mushroom_7350@reddit
Yeah it does not work with MTP versions, since the model is predicting tokens 2-6x depending on settings this leads to more hallucinations, so your k v cache needs to be bf16 to q8.. However, I 100 percent agree with you non MTP variants..
I found the MTP speed was nice, but by the time I was in a good spot, the context was already over with... so I had to admit that I can not run the model in a real world scenario.
chibop1@reddit
Unfortunately none. Even a few months ago, sub 100b models couldn't handle toolcalls reliably. IMHO, Qwen-3.6 is the first sub 100b model that is decent at tool calling.
BitGreen1270@reddit (OP)
Thanks - I'm sort of converging onto that viewpoint. I guess e4b and e2b are okay if you don't mind repeating instructions and keep editing the context for prompts to make the instructions tighter and tighter. But after a point it gets tedious.