How to connect Claude Code CLI to a local llama.cpp server
Posted by StrikeOner@reddit | LocalLLaMA | View on Reddit | 44 comments
How to connect Claude Code CLI to a local llama.cpp server
I’ve seen a lot of people struggling to get Claude Code working with a local llama.cpp setup, so here’s a quick guide that worked for me.
1. CLI (Terminal)
Add this to your .bashrc (or .zshrc):
export ANTHROPIC_AUTH_TOKEN="not_set"
export ANTHROPIC_API_KEY="not_set_either!"
export ANTHROPIC_BASE_URL="http://<your-llama.cpp-server>:8080"
Reload your shell:
source ~/.bashrc
and run the cli with the model argument:
claude --model Qwen3.5-35B-Thinking
2. VS Code setup with the Claude Code extension installed
Edit:
$HOME/.config/Code/User/settings.json
Add:
"claudeCode.environmentVariables": [
{
"name": "ANTHROPIC_BASE_URL",
"value": "http://<your-llama.cpp-server>:8080"
},
{
"name": "ANTHROPIC_AUTH_TOKEN",
"value": "dummy"
},
{
"name": "ANTHROPIC_API_KEY",
"value": "sk-no-key-required"
},
{
"name": "ANTHROPIC_MODEL",
"value": "gpt-oss-20b"
},
{
"name": "ANTHROPIC_DEFAULT_SONNET_MODEL",
"value": "Qwen3.5-35B-Thinking-Coding"
},
{
"name": "ANTHROPIC_DEFAULT_OPUS_MODEL",
"value": "Qwen3.5-27B-Thinking-Coding"
},
{
"name": "ANTHROPIC_DEFAULT_HAIKU_MODEL",
"value": "gpt-oss-20b"
},
{
"name": "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC",
"value": "1"
}
],
"claudeCode.disableLoginPrompt": true
Notes
- This setup lets you use
llama.cpp’s server (orllama-swap) to dynamically switch models by selecting one of the preconfigured ones in vscode. - Make sure the model names you define here exactly match what you configured in your
llama-server.ini.
Prestigious_Ebb4380@reddit
Do i do anything wrong? Why is this happening?
PS C:\Users\user> $env:ANTHROPIC_BASE_URL = "http://127.0.0.1:8080/v1"
PS C:\Users\user> $env:ANTHROPIC_API_KEY = "local-no-key-needed"
PS C:\Users\user> $env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1"
PS C:\Users\user> $env:CLAUDE_CODE_ATTRIBUTION_HEADER = "0"
PS C:\Users\user> $env:CLAUDE_CODE_DISABLE_1M_CONTEXT = "1"
PS C:\Users\user> $env:CLAUDE_CODE_MAX_OUTPUT_TOKENS = "65536"
PS C:\Users\user> claude --model gemma-4-26b
There is some error on my server too! Below
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv log_server_r: done request: HEAD /v1 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: HEAD /v1 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404
Prestigious_Ebb4380@reddit
ok i solved it, i use v1 twice
twanz18@reddit
Once you get it connected, one thing worth trying is running the agent remotely from your phone. I set up OpenACP to bridge Claude Code to Telegram so I can trigger tasks and see streaming output while away from my desk. Works with llama.cpp backed agents too since it supports any CLI agent. The setup takes about 5 minutes if you already have Node installed. Full disclosure: I contribute to the project.
twanz18@reddit
Once you get it connected, one thing worth trying is running the agent remotely from your phone. I set up OpenACP to bridge Claude Code to Telegram so I can trigger tasks and see streaming output while away from my desk. Works with llama.cpp backed agents too since it supports any CLI agent. The setup takes about 5 minutes if you already have Node installed. Full disclosure: I contribute to the project.
FeiX7@reddit
Thank you, you inspired me to write this post, and helped a lot.
https://www.reddit.com/r/LocalLLaMA/comments/1scrnzm/local_claude_code_with_qwen35_27b/
FeiX7@reddit
I have tested your approach, but I can't insert images in CC with qwen3.5 27B
```
[Image #1] Analyze the image.
⎿ [Image #1]
⎿ API Error: 500 {"error":{"code":500,"message":"image input is not supported - hint: if this is
unexpected, you may need to provide the mmproj","type":"server_error"}}
```
Jaded_Towel3351@reddit
did you launch the qwen3.5 27B with the mmproj file, otherwise it dont have vision capablities
FeiX7@reddit
yeah, after my comment I realised I required mmproj file...
Robos_Basilisk@reddit
How does this work with respect to local models that different context lengths than Claude's models, does it adjust?
I'm going to try this out later today, thanks!
StrikeOner@reddit (OP)
mhh, good question. i dont think it does. the few times i tried to use this cli with my local models it was a pure failure on complex tasks but where you say that now this probably might have been the issue there. its probably a good idea to use one of the models with less context. let me update this post i did.
m_mukhtar@reddit
you can do control the context and tell claude code about your limit by setting two environment variables in your `\~/.claude/settings.json`
the first one is
CLAUDE_CODE_AUTO_COMPACT_WINDOW and i set this one to my actual llama.cpp context limit ( for me i can run Qwen3.5-27b-Q5 with --ctx-size 110000 without KV quantization) so i set this arguument to 110000.the second one is CLAUDE_AUTOCOMPACT_PCT_OVERRIDE and this is the precentage of the above one where cloude code needs to do context compaction so you never send any thing to llama.cpp over what you can run. if you wanna use the entire 110000 that we setup in the previous variable then we would set this to 100 but for me to be safe i set it at 95here is my \~/.claude/settings.json``\``{`"$schema": "https://json.schemastore.org/claude-code-settings.json","model": "Qwen_Qwen3.5-27b","env": {"ANTHROPIC_BASE_URL": "http://192.168.1.150:8001","ANTHROPIC_API_KEY": "none","CLAUDE_CODE_ATTRIBUTION_HEADER": "0","CLAUDE_CODE_AUTO_COMPACT_WINDOW": "110000","CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "95","DISABLE_PROMPT_CACHING": "1","CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1","CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1","MAX_THINKING_TOKENS": "0","CLAUDE_CODE_DISABLE_1M_CONTEXT": "1","CLAUDE_CODE_DISABLE_FAST_MODE": "1","CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1","CLAUDE_CODE_DISABLE_AUTO_MEMORY": "1","DISABLE_AUTOUPDATER": "1"},"attribution": {"commit": "","pr": ""},"promptSuggestionEnabled": false,"prefersReducedMotion": true,"terminalProgressBarEnabled": false}\```if you want to know what the other variables do here is a quick rundown of everything. basically i used claude documentationhttps://code.claude.com/docs/en/env-varsto see all possible variables and if i saw something that is specific to claude models i disabled it as it will send headders and additional information that could cause problems with llama.cpp or cause confusion to the modelDISABLE_PROMPT_CACHING: "1"
this is a claude specific feature to send prompt caching headers but llama.cpp does not use that to it could cause unexpected behavior.
CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"
removes claude specific beta request headers from API calls, again this is to prevents unexpected behavior"
CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING: "1"
this is also a claude specific feature where the model dynamically allocates thinking tokens so just disable it.
MAX_THINKING_TOKENS: "0"
extended thinking is a claude specific feature. setting to 0 disables it entirely. Qwen model has its own thinking mechanism (which is by default enable in llama.cpp unless disabled via --chat-template-kwargs), but it handles that internally so claude code's thinking budget system doesn't apply.
CLAUDE_CODE_DISABLE_1M_CONTEXT: "1"
removes the 1M context variants from the model picker. irrelevant for local models and keeps the UI clean.
CLAUDE_CODE_DISABLE_FAST_MODE: "1"
this is also a claude specific feature that uses a faster model for simpler tasks. disable it
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: "1"
this disables the auto-updater, feedback command, Sentry error reporting, and Statsig telemetry all at once. none of these is useful and i thought they might cause unexpected behaviour.
CLAUDE_CODE_DISABLE_AUTO_MEMORY: "1"
this feature creates and loads memory files by communicating with anthropic's servers. wont work with a local endpoint, so just disable it
DISABLE_AUTOUPDATER: "1"
same as the one above
additional nice things to set
attribution: i sit this to empty strings for both commit and pr to disable the "Generated with Claude Code" byline in git commits and PRs.
promptSuggestionEnabled: false, to disable the grayed-out prompt suggestions that appear after responses. these rely on a background Haiku call that won't work here
prefersReducedMotion: true and terminalProgressBarEnabled: false reduce UI overhead. these are vey minor but keeps things snappy.
sorry if i have spelling or grammar mistakes english is not my first language
fierlion@reddit
thank you for this. I've now got a great local claude + qwen flow going.
m_mukhtar@reddit
Glad this was helpful and i agree that qwen with claude code is great local coding experiance. If you dont mind sharing which qwen model and what quantization you are using
fierlion@reddit
[Qwen3-Coder-Next-REAP-48B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3-Coder-Next-REAP-48B-A3B-MXFP4_MOE-GGUF) uses MXFP4 quantization.
m_mukhtar@reddit
Hmm intresting. I gotta try this one. I have been using the 27b at q5 k xl from Bartowski and it has been great. Thanks for sharing
StrikeOner@reddit (OP)
oh, thats way better. let me update the main article one more time. thanks a lot!
FeiX7@reddit
unsloths guide here https://unsloth.ai/docs/basics/claude-code
Fun_Nebula_9682@reddit
nice guide. the performance issues you hit are probably from context window — claude code sends a massive system prompt (CLAUDE.md files, skills, hooks, tool definitions) that easily eats 20-30k tokens before your first message. local models with 32k context are basically running at capacity the whole time.
the other killer is prompt caching. claude code is heavily optimized around anthropic's cache prefix system where static system prompt stays cached across turns. with local llama.cpp that optimization layer doesnt exist so every turn reprocesses everything from scratch. it works but you'll feel the latency hard
StrikeOner@reddit (OP)
just updated my post, the cli went from zero to hero with those updated settings. give it a try!
Code_Doctor_83@reddit
Can I set this up on a mid end PC? Ryzen 7600x 16gigs RAM and no GPU :(
StrikeOner@reddit (OP)
oh, i realy have got no clue. thats a hard cap you got there. you can try Qwen3.5-9B-GGUF for example.
LegacyRemaster@reddit
I think we'll see llamacpp + claude code soon
StrikeOner@reddit (OP)
we do sir we do! with all the great submissions i created a new config and just finished my benchmark run right now. claude performs crazyly good for me now! let me prepare the final update for the article. wowawiwa!
EvilBot-666@reddit
Same here — I’d been messing around with Ollama forever. The same model, even with higher quantization (Q6 instead of Q4), runs way faster with llama.cpp. This guide really helped, and I’ve got the Claude Code extension for VS Code running like a charm now. Thanks!
LegacyRemaster@reddit
hero
truthputer@reddit
Settings I use:
Start llama.cpp:
Save to \~/.claude-llama/settings.json :
Start Claude:
I'm keeping my settings separate from the main Claude config so I can switch back and forwards - and the important part here is CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC and CLAUDE_CODE_ATTRIBUTION_HEADER - without these my understanding is it can confuse local LLMs with info that can cause cache misses.
Koalababies@reddit
The final two env variables made a huge difference for me with cache hits for qwen and minimax
redaktid@reddit
Yea removing attribution headers gives a big speedup, otherwise I think it breaks prompt processing
StrikeOner@reddit (OP)
mhh, great one more var to add to the list. let me update the main post.
iamsaitam@reddit
I just set the anthropic base url and it works
jacek2023@reddit
Have you investigated external network traffic (to anthropic, etc) when using local models?
StrikeOner@reddit (OP)
uhm, not using wireshark or such nope. why are you asking?
jacek2023@reddit
I use Claude Code but only with Claude models (for local models I use OpenCode). I wonder is it truly local or maybe Anthropic still uses something on their side.
StrikeOner@reddit (OP)
can't tell. i did not investigate this deep. it was enough that it did connect to my llama-server instance in my network. i actually dont use this cli that much to be true too, i just tought i might share this here now since i have seen a couple guys struggling with this lately.
jacek2023@reddit
At some point I will try to use it fully offline (with disabled Internet access) and with the sniffer to find out.
SurprisinglyInformed@reddit
I also have these two settings on my file, based on
https://code.claude.com/docs/en/monitoring-usage
and
https://code.claude.com/docs/en/data-usage
{
"name": "CLAUDE_CODE_ENABLE_TELEMETRY",
"value": "0"
},
{
"name": "DISABLE_TELEMETRY",
"value": "1"
},
Spectrum1523@reddit
Now that the code has leaked we can audit it ourselves lol
Lissanro@reddit
A while ago when I decide to test Claude Code out of curiosity with local model (Kimi K2.5 running with llama.cpp), it did not work at all - I had all anthropic domains blocked, it just kept looping over errors about not being able to connect somewhere instead of doing the task. It seems Claude Code is not intended to be used locally. It also required hacking
~/.claude.jsonto sethasCompletedOnboardingtotrue, otherwise it wouldn't even let to try anything (I never had Anthropic account and tested Claude Code locally only).jacek2023@reddit
that's why I asked, maybe it depends somehow on the cloud
donmario2004@reddit
If using a vm, like parallels desktop set server to 0.0.0.0, and then you can run llama.cpp in your regular os and have Claude code connect to it inside the vm.
vasimv@reddit
I've found that is much easier to use alias in llama.cpp (-alias localmodel) and then use the name for claude and other programs using the model, instead its real name. Easy to type, easy to switch to another model if needed.
OrbMan99@reddit
That's a good tip, and most people are going to be running multiple local models at once. If you're switching to a model with a different context size, is Claude going to pick that up automatically, or is a restart needed?
vasimv@reddit
I'm not sure if claude code has that ability. I have to change context size limit in claude code manually.
CulturalMatter2560@reddit
Could actually have something like ampere.sh do it for you... bit of a catch 22 lol