Are Agents even useful with all local models?
Posted by bsawler@reddit | LocalLLaMA | View on Reddit | 21 comments
I've been trying to step up my usage and try out all the new toys over the past weeks. It feels like I've been jumping from thing to thing to thing.
Claude Code (with local LLM), OpenClaw, Hermes, Pi, Paperclip, etc.
Are there ANY of them that actually "just work" with local LLMs? With the exception of Pi, which is super-restrictive by default, all of the rest just tend to be failure after failure after failure for every task I give them that isn't just "write a markdown document" or "write a bit of code in language X".
Claude was able to (extremely slowly, like 1/10th the speed of Pi) generate some python that was passable. But anything beyond simple document reading/writing/editing would fail because it expected Anthropics various services.
OpenClaw failed non-stop at any task I gave it beyond simple chatting (which if I'm just going to chat, I don't need an agentic harness!) unless I go install a bunch of security-risk-ridden software that's going to do god-knows-what on my network.
Hermes would (sometimes) show up in Discord / Slack. But half of their functionality would fail - sure it could generate a document, and even got it to talk to my local ComfyUI to generate a (truly horrible looking) image, but it couldn't actually pin it to Slack or Discord which means I had no way to getting anything from it short of breaking into the docker's storage and doing a manual exfil operation...
And then lastly Paperclip yay my CEO hired a CTO and CMO... and they both immediately failed their tasks and every issue I file against any of my AI "employees" would end up spinning and failing to complete anything.
All of this is across a number of models on my Strix Halo system (so 128gb, 112gb usable as vram): Qwen 3.5, 3.6, Qwen 3 Coder Next, Llama 3.3 70b, GPT-OSS 120b, GLM 4.7 Flash, Gemma 4 31b and e4b.
I'm 100% willing to believe I'm just dumb and missing something... but after weeks of trying different tools and running into similar issues over and over again... is this just where we're at for local AI? We can locally host all the agents but that means nothing if you still have to sign up for countless subscriptions and pass all the data to outside services, which is the entire reason I (and many of you, I suspect) am wasting all this money on local AI hardware to begin with.
cakemates@reddit
I have done tons of cool shit with just these models, I have an army of bots and agents that I built helping me around on my hobbies and work to some degree. But you have to understand this hobby is in the bleeding edge of technology and you need to adjust your expectations to reality, expecting things to "just work" is asking for too much at this point, we will get there some day.
bsawler@reddit (OP)
Yeah the more I'm reading the comments here the more I'm leaning into the "YouTubers are all just full of crap" ... which honestly I'm old enough that I should know this by now. But every now and then I give in to the hype and it punishes me :)
colin_colout@reddit
which YouTuber? if you don't see them use it for more than a few turns (and something more than a toy example), then expect that to be all you get if you follow their footsteps.
if they had an "sonnet at home", they'd show the while workflow.
abnormal_human@reddit
You can absolutely do useful work with them with the right harness. If you're just plugging+praying into an OpenClaw type system you'll have a lot less luck.
bsawler@reddit (OP)
You say useful work but can you give me an example of what type of fully-local workload you've been able to do, that isn't just "write some code" or "write a document" or is that still where we're at, despite what reddit and YouTube and the rest of the internet would have me believe?
abnormal_human@reddit
I am building an agent that does visual asset/graphics work in multiple domains. Multistep processes, iteration, vision, complex tasks, long context, etc. My main targets are Qwen 3.5 122B and 397B. Before Qwen 3.5 I was using gpt-oss-120b with qwen3-vl-30b-a3b as a separate vision model. I try to target open source to keep costs under control.
It's an open-world agent. Free access to filesystem, sandbox, python interpreter + application SDK. It needs to make choices and tool calls to free-form accomplish tasks.
What makes this work well? Simple, disciplined approach to system prompts. Progressive disclosure. A robust skills system. Highly general tools that mirror typical coding agent tools, since this is what model producers are RL'ing around. And of course a comprehensive eval suite that proves out the performance. If you're developing an agent harness, and you eval around a model, you'll either make it work or figure out why you can't. If you're not running evals you may as well be guessing.
bsawler@reddit (OP)
How much of all of that were things that worked out of the box with some proper prompting, and how much were extensions/addons/skills/tools that you had to develop and implement to enable all of it?
I'm wondering if it's just that my expectations are wrong, and I shouldn't assume these highly-touted agents can actually do much of anything, but rather just enable me to implement the various things I want them to do? In which case they're not much better than my old method of calling my local LLM's API and passing in my own tools that I implement myself?
And if that's the case that's fine - maybe it's less a "this just doesn't work" and more just "I need to reframe how I think about all of this" which certainly wouldn't be the first time.
abnormal_human@reddit
There is no "out of the box" or "plug'n'play" with LLMs. Even switching between frontier models requires harness optimization and prompt re-engineering to extract max performance. With most local models, the burden to get the harness right is higher because the models are weaker.
Personally, I built a harness from the ground up, but my goal is to make a product that enables other people to do work, not just to get work done, so I had many reasons for that approach.
nikhilprasanth@reddit
With the right harness and parameters. Models like qwen follow instructions nicely. For example you could have it setup containers for repos by simply pointing them to a GitHub repo and ask it to build the container.
For a personal example, I had it setup a skill where I share an instagram reel or youtube shorts with movie/series/book recommendations. It process the metadata first to get the content , if fails it tries the audio, if it fails it tries to take frames from the video and ocr to understand the recommendations. Then it websearches to get the ratings. Save them to my obsidian in respective watchlist.
JamesEvoAI@reddit
You should stop looking for the direction of the industry from Youtube hypebeasts. That said I regularly let these things rip on my inference server to do configuration updates. Right now I have Qwen 3.5 downloading and setting up my llama-swap/LiteLLM setup with Qwen 3.6 27B.
sdfgeoff@reddit
I think you'll find that most people consider the ability to speak to a computer via STT, have it edit a document or write some code, and then answer back with TTS is marvellous (and fairly usefull).
Low_Blueberry_6711@reddit
Tool calling reliability is the real bottleneck — most local models weren't trained with the same volume of function-calling data as GPT-4 or Claude. Smaller models especially tend to hallucinate tool signatures or loop on failures. Mistral and Qwen series hold up better than most for actual agentic stuff in my experience, but you'll still hit walls on multi-step tasks that require backtracking.
MengerianMango@reddit
I like goose. it's very simple. depends what you're trying to do ig. i'm mostly a linux programmer/sysadmin type, so i live in the terminal anw.
Double_Cause4609@reddit
...By the way, what command are you running llama-server with? Are you passing `--jinja`?
Also, have you verified the same results in vLLM (which has a more stable function calling scheme)?
bsawler@reddit (OP)
I haven't tried vLLM as I just haven't gotten it running stably (found a version of ROCm that plays nice w/ comfyUI but vLLM hates it and crashes the system... probably need to find a different version of ROCm that vLLM likes?).
My llama-server lines look something like:
./llama-server --models-preset models.ini --jinja -fa on -np 2 --no-mmap -ub 4096 -b 4096 --host 0.0.0.0 --port 8033
with the entries in models.ini looking like:
[qwencoder]
model = /home/bcs/llmmodels/Qwen_Qwen3-Coder-Next-Q4_K_M.gguf
temp = 0.7
top-p = 0.95
top-k = 40
ctx-size = 131072
n-gpu-layers = 99
jinja = on
sdfgeoff@reddit
Last week I fiddled with Hermes quite successfully with Qwen3.5 27B using Unsloth's Q4 quant. Make sure you update llama-cpp. If you're using ollama, that's almost certainly the problem.
bsawler@reddit (OP)
Ah important info I left out - just updated the body to include that yes this is on llama-server (llama.cpp), updated regularly (roughly weekly) and running everything with a 128k context size.
sdfgeoff@reddit
Hmm, should be fine then. What is actually going wrong? I just vibe coded a webapp (react, typescript, rust backend) with Qwen3.6-37B-Q_0 unsloth quants, just by pointing claude code at it. Seemed to work fine to me? It took 8 hours to do it, and maybe claude would have done a better job, but it did it fine.
Don't overthink all the agentic stuff. Simple often is best. I often think large/complex projects like Paperclip are cool in theory, but are 50% roleplaying, 40% waste of tokens and maybe 10% better than just using a single coding agent.
Agents like openclaw/hermets are mostly just coding agents with a scheduler and telegram integration. Useful, but more complex/marginal utility over a simple coding agent for a technical user.
bsawler@reddit (OP)
Yeah I'm wondering how much of what I'm failing at is just me expecting things to be as useful as all the YouTube hype videos make them seem like they should be. But simple things like "generate a logo for XYZ and post it to the #art channel" led to a crap-tastic image being generated (fine, I'll take the blame for image quality and prompting) and then being spammed with 45 "Here's you image" posts to the #art channel on slack... with no image attached. It's that type of stuff that just feels like it's supposed to work and just isn't for me, and I'm wondering if it's model (I feel like I've tried enough there...) or just not something expected to work for local models right now.
sdfgeoff@reddit
I found Qwen3.5 often struggled with tool calling in Hermes and would go "I'm patching file X" the tool would error, and it would go "silly me, I forgot the file name, let me try again" and then endlessly loop doing that. I think that is because Hermes agent is different to whatever harness it was expecting, or maybe it's tool descriptions aren't quite clear. I haven't tried Hermes with 3.6 yet. However, both worked fine in claude-code as a harness - maybe claude-code has better handling around failed tool calls somehow? I haven't dug into it.
Images also occupy a lot of context if it's decided to look at it itself, which probably wouldn't help it's intelligence.
ea_man@reddit
Models, prompts, harness are not all the same and don't necessarily mix up.
If you want something plug n play try QWEN3.6 27B with Qwencode.
To edit files usually Aider works too.