Claude code can now connect directly to llama.cpp server
Posted by tarruda@reddit | LocalLLaMA | View on Reddit | 15 comments
Anthropic messages API was merged today and allows claude code to connect to llama-server: https://github.com/ggml-org/llama.cpp/pull/17570
I've been playing with claude code + gpt-oss 120b and it seems to work well at 700 pp and 60 t/s. I don't recommend trying slower LLMs because the prompt processing time is going to kill the experience.
PotentialFunny7143@reddit
Nice, which is the smaller llm that works with it? Qwen3-coder-30b-a3b works?
tarruda@reddit (OP)
I didn't try it, but I suspect it won't work well with smaller LLMs because claude code is very inefficient in its token usage. IIRC just the system prompt takes about 15k tokens, and I haven't seen a LLM in the range of 30B parameters that can deal with this.
No-Low9948@reddit
As you can see from `/context`, the context pressure of Claude Code's built-in tools is high.
They use up to 15k tokens. If we can remove the ones we don't need, the system prompt itself uses only 3.1k tokens, so the burden should be significantly reduced.
tarruda@reddit (OP)
What is
/contextand how do I see it?Yea I imagine most tools exposed by claude code are not needed. Would be cool if llama-server had a way to customizable way to preprocess request parameters before passing to the chat template. That way we could delete the tools that are not needed, or even simplify the system prompt.
dtdisapointingresult@reddit
Should be trivial with a Python API proxy in the middle. Point CC at proxy, write some Python to trim it, then forward to llama.cpp.
PotentialFunny7143@reddit
wow 15k is a lot, open code is 11k and already works quite well
sakastudio@reddit
In reality, I think it's better to use OpenAI or Anthropic's AI for agent coding.
I think this feature will only be useful in some very closed projects.
Magnus114@reddit
Is there any advantages or using claude code router?
tarruda@reddit (OP)
Not familiar with claude code router, but the Anthropic API support allows claude code to connect directly to llama.cpp without any proxies or adapters.
noiserr@reddit
If the slower models were at least better they could perhaps compensate with the fact that they make better decisions so they have to iterate less. The issue is they aren't better. At least the models I've tried up to 235 billion parameters. I can't run anything bigger.
tarruda@reddit (OP)
The only LLM in that range that I've started to try was Minimax M2 IQ4_XS, but it was far too slow for me to proceed. I gave up after it taking 30 minutes to answer the first question about a project. I didn't even bother with Qwen3 235B because it is much smaller and probably wasn't even trained for this kind of agentic use anyway.
IMO the main issue is not caused by the LLMs, but by the fact that claude code basic system prompt + tools take about 15k tokens of context. Local models would benefit from a more efficient CLI tool, one that makes better use of the available context.
GPT-OSS 120b can work (20b seems to get confused by the big system prompt + tools) but it seems to work better with codex.
Mythril_Zombie@reddit
Have any info on how to get them to talk?
tarruda@reddit (OP)
Not sure what you mean, but if you just want to know how to connect claude code, all you need is to set ANTHROPIC_BASE_URL env var to point to your llama-server url:
I'd say you need at least 60k context to get anything out of claude code.
No-Statement-0001@reddit
Awesome! I’ll add this to llama-swap this weekend!
ilintar@reddit
Yup, very cool feature. Also works with other clients expecting Anthropic Messages API instead of OpenAI.