How to connect Claude Code CLI to a local llama.cpp server

[-]

Apollyon91@reddit

Does anyone happen to know how to setup Claude code with docker sandbox, serving local models?

https://docs.docker.com/ai/sandboxes/get-started/

[-]

doesn't work for me, don't know why.
I have setup my llama-server with unsloth/Qwen3.5-397B-A17B in non thinking,
put it on 11434 port
setup all var env for claude,
the only things it does is HEAD call and never send any message
/model is correctly received by server and send auto a v1/messages post, that is receveid and respond by llama-server and after that nothing, can't even discuth with it.
Claude finish it's 10 attemps and throw a can't connect to api error.

The llama-server UI is fully fonctionnal.
Someone can help ?

[-]

Prestigious_Ebb4380@reddit

Do i do anything wrong? Why is this happening？

PS C:\Users\user> $env:ANTHROPIC_BASE_URL = "http://127.0.0.1:8080/v1"

PS C:\Users\user> $env:ANTHROPIC_API_KEY = "local-no-key-needed"

PS C:\Users\user> $env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1"

PS C:\Users\user> $env:CLAUDE_CODE_ATTRIBUTION_HEADER = "0"

PS C:\Users\user> $env:CLAUDE_CODE_DISABLE_1M_CONTEXT = "1"

PS C:\Users\user> $env:CLAUDE_CODE_MAX_OUTPUT_TOKENS = "65536"

PS C:\Users\user> claude --model gemma-4-26b

There is some error on my server too! Below

main: server is listening on http://127.0.0.1:8080

main: starting the main loop...

srv update_slots: all slots are idle

srv log_server_r: done request: HEAD /v1 127.0.0.1 404

srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404

srv log_server_r: done request: HEAD /v1 127.0.0.1 404

srv log_server_r: done request: POST /v1/v1/messages 127.0.0.1 404

[-]

Prestigious_Ebb4380@reddit

ok i solved it, i use v1 twice

[-]

twanz18@reddit

Once you get it connected, one thing worth trying is running the agent remotely from your phone. I set up OpenACP to bridge Claude Code to Telegram so I can trigger tasks and see streaming output while away from my desk. Works with llama.cpp backed agents too since it supports any CLI agent. The setup takes about 5 minutes if you already have Node installed. Full disclosure: I contribute to the project.

[-]

twanz18@reddit

Once you get it connected, one thing worth trying is running the agent remotely from your phone. I set up OpenACP to bridge Claude Code to Telegram so I can trigger tasks and see streaming output while away from my desk. Works with llama.cpp backed agents too since it supports any CLI agent. The setup takes about 5 minutes if you already have Node installed. Full disclosure: I contribute to the project.

[-]

FeiX7@reddit

Thank you, you inspired me to write this post, and helped a lot.

https://www.reddit.com/r/LocalLLaMA/comments/1scrnzm/local_claude_code_with_qwen35_27b/

[-]

FeiX7@reddit

I have tested your approach, but I can't insert images in CC with qwen3.5 27B

```
[Image #1] Analyze the image.
⎿ [Image #1]
⎿ API Error: 500 {"error":{"code":500,"message":"image input is not supported - hint: if this is
unexpected, you may need to provide the mmproj","type":"server_error"}}
```

[-]

Jaded_Towel3351@reddit

did you launch the qwen3.5 27B with the mmproj file, otherwise it dont have vision capablities

[-]

FeiX7@reddit

yeah, after my comment I realised I required mmproj file...

[-]

Robos_Basilisk@reddit

How does this work with respect to local models that different context lengths than Claude's models, does it adjust?

I'm going to try this out later today, thanks!

[-]

StrikeOner@reddit (OP)

mhh, good question. i dont think it does. the few times i tried to use this cli with my local models it was a pure failure on complex tasks but where you say that now this probably might have been the issue there. its probably a good idea to use one of the models with less context. let me update this post i did.

[-]

m_mukhtar@reddit

you can do control the context and tell claude code about your limit by setting two environment variables in your `\~/.claude/settings.json`

the first one is CLAUDE_CODE_AUTO_COMPACT_WINDOW and i set this one to my actual llama.cpp context limit ( for me i can run Qwen3.5-27b-Q5 with --ctx-size 110000 without KV quantization) so i set this arguument to 110000.

the second one is CLAUDE_AUTOCOMPACT_PCT_OVERRIDE and this is the precentage of the above one where cloude code needs to do context compaction so you never send any thing to llama.cpp over what you can run. if you wanna use the entire 110000 that we setup in the previous variable then we would set this to 100 but for me to be safe i set it at 95

here is my \~/.claude/settings.json``

\``{`

"$schema": "https://json.schemastore.org/claude-code-settings.json",

"model": "Qwen_Qwen3.5-27b",

"env": {

"ANTHROPIC_BASE_URL": "http://192.168.1.150:8001",

"ANTHROPIC_API_KEY": "none",

"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",

"CLAUDE_CODE_AUTO_COMPACT_WINDOW": "110000",

"CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "95",

"DISABLE_PROMPT_CACHING": "1",

"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1",

"CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",

"MAX_THINKING_TOKENS": "0",

"CLAUDE_CODE_DISABLE_1M_CONTEXT": "1",

"CLAUDE_CODE_DISABLE_FAST_MODE": "1",

"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",

"CLAUDE_CODE_DISABLE_AUTO_MEMORY": "1",

"DISABLE_AUTOUPDATER": "1"

},

"attribution": {

"commit": "",

"pr": ""

},

"promptSuggestionEnabled": false,

"prefersReducedMotion": true,

"terminalProgressBarEnabled": false

}
\```

if you want to know what the other variables do here is a quick rundown of everything. basically i used claude documentation https://code.claude.com/docs/en/env-vars to see all possible variables and if i saw something that is specific to claude models i disabled it as it will send headders and additional information that could cause problems with llama.cpp or cause confusion to the model

DISABLE_PROMPT_CACHING: "1"

this is a claude specific feature to send prompt caching headers but llama.cpp does not use that to it could cause unexpected behavior.

CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1"

removes claude specific beta request headers from API calls, again this is to prevents unexpected behavior"

CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING: "1"

this is also a claude specific feature where the model dynamically allocates thinking tokens so just disable it.

MAX_THINKING_TOKENS: "0"

extended thinking is a claude specific feature. setting to 0 disables it entirely. Qwen model has its own thinking mechanism (which is by default enable in llama.cpp unless disabled via --chat-template-kwargs), but it handles that internally so claude code's thinking budget system doesn't apply.

CLAUDE_CODE_DISABLE_1M_CONTEXT: "1"

removes the 1M context variants from the model picker. irrelevant for local models and keeps the UI clean.

CLAUDE_CODE_DISABLE_FAST_MODE: "1"

this is also a claude specific feature that uses a faster model for simpler tasks. disable it

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: "1"

this disables the auto-updater, feedback command, Sentry error reporting, and Statsig telemetry all at once. none of these is useful and i thought they might cause unexpected behaviour.

CLAUDE_CODE_DISABLE_AUTO_MEMORY: "1"

this feature creates and loads memory files by communicating with anthropic's servers. wont work with a local endpoint, so just disable it

DISABLE_AUTOUPDATER: "1"

same as the one above

additional nice things to set

attribution: i sit this to empty strings for both commit and pr to disable the "Generated with Claude Code" byline in git commits and PRs.

promptSuggestionEnabled: false, to disable the grayed-out prompt suggestions that appear after responses. these rely on a background Haiku call that won't work here

prefersReducedMotion: true and terminalProgressBarEnabled: false reduce UI overhead. these are vey minor but keeps things snappy.

sorry if i have spelling or grammar mistakes english is not my first language

[-]

fierlion@reddit

thank you for this. I've now got a great local claude + qwen flow going.

[-]

m_mukhtar@reddit

Glad this was helpful and i agree that qwen with claude code is great local coding experiance. If you dont mind sharing which qwen model and what quantization you are using

[-]

fierlion@reddit

[Qwen3-Coder-Next-REAP-48B-A3B-MXFP4_MOE-GGUF](https://huggingface.co/noctrex/Qwen3-Coder-Next-REAP-48B-A3B-MXFP4_MOE-GGUF) uses MXFP4 quantization.

[-]

m_mukhtar@reddit

Hmm intresting. I gotta try this one. I have been using the 27b at q5 k xl from Bartowski and it has been great. Thanks for sharing

[-]

StrikeOner@reddit (OP)

oh, thats way better. let me update the main article one more time. thanks a lot!

[-]

FeiX7@reddit

unsloths guide here https://unsloth.ai/docs/basics/claude-code

[-]

Fun_Nebula_9682@reddit

nice guide. the performance issues you hit are probably from context window — claude code sends a massive system prompt (CLAUDE.md files, skills, hooks, tool definitions) that easily eats 20-30k tokens before your first message. local models with 32k context are basically running at capacity the whole time.

the other killer is prompt caching. claude code is heavily optimized around anthropic's cache prefix system where static system prompt stays cached across turns. with local llama.cpp that optimization layer doesnt exist so every turn reprocesses everything from scratch. it works but you'll feel the latency hard

[-]

StrikeOner@reddit (OP)

just updated my post, the cli went from zero to hero with those updated settings. give it a try!

[-]

Code_Doctor_83@reddit

Can I set this up on a mid end PC? Ryzen 7600x 16gigs RAM and no GPU :(

[-]

StrikeOner@reddit (OP)

oh, i realy have got no clue. thats a hard cap you got there. you can try Qwen3.5-9B-GGUF for example.

[-]

LegacyRemaster@reddit

I think we'll see llamacpp + claude code soon

[-]

StrikeOner@reddit (OP)

we do sir we do! with all the great submissions i created a new config and just finished my benchmark run right now. claude performs crazyly good for me now! let me prepare the final update for the article. wowawiwa!

[-]

EvilBot-666@reddit

Same here — I’d been messing around with Ollama forever. The same model, even with higher quantization (Q6 instead of Q4), runs way faster with llama.cpp. This guide really helped, and I’ve got the Claude Code extension for VS Code running like a charm now. Thanks!

[-]

LegacyRemaster@reddit

hero

[-]

truthputer@reddit

Settings I use:

Start llama.cpp:

llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL --ctx-size 128000 --port 8081 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00

Save to \~/.claude-llama/settings.json :

{ "env": { "ANTHROPIC_BASE_URL": "http://127.0.0.1:8081", "ANTHROPIC_MODEL": "Qwen3.5-35B-A3B", "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",  "CLAUDE_CODE_ATTRIBUTION_HEADER" : "0" }, "model": "Qwen3.5-35B-A3B",   "theme": "dark" }

Start Claude:

export CLAUDE_CONFIG_DIR="$HOME/.claude-llama"
export ANTHROPIC_BASE_URL="http://127.0.0.1:8081"
export ANTHROPIC_API_KEY="" export ANTHROPIC_AUTH_TOKEN=""
claude --model Qwen3.5-35B-A3B

I'm keeping my settings separate from the main Claude config so I can switch back and forwards - and the important part here is CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC and CLAUDE_CODE_ATTRIBUTION_HEADER - without these my understanding is it can confuse local LLMs with info that can cause cache misses.

[-]

Koalababies@reddit

The final two env variables made a huge difference for me with cache hits for qwen and minimax

[-]

redaktid@reddit

Yea removing attribution headers gives a big speedup, otherwise I think it breaks prompt processing

[-]

StrikeOner@reddit (OP)

mhh, great one more var to add to the list. let me update the main post.

[-]

iamsaitam@reddit

I just set the anthropic base url and it works

[-]

jacek2023@reddit

Have you investigated external network traffic (to anthropic, etc) when using local models?

[-]

StrikeOner@reddit (OP)

uhm, not using wireshark or such nope. why are you asking?

[-]

jacek2023@reddit

I use Claude Code but only with Claude models (for local models I use OpenCode). I wonder is it truly local or maybe Anthropic still uses something on their side.

[-]

StrikeOner@reddit (OP)

can't tell. i did not investigate this deep. it was enough that it did connect to my llama-server instance in my network. i actually dont use this cli that much to be true too, i just tought i might share this here now since i have seen a couple guys struggling with this lately.

[-]

jacek2023@reddit

At some point I will try to use it fully offline (with disabled Internet access) and with the sniffer to find out.

[-]

SurprisinglyInformed@reddit

I also have these two settings on my file, based on
https://code.claude.com/docs/en/monitoring-usage
and
https://code.claude.com/docs/en/data-usage

{
"name": "CLAUDE_CODE_ENABLE_TELEMETRY",
"value": "0"
},

{
"name": "DISABLE_TELEMETRY",
"value": "1"
},

[-]

Spectrum1523@reddit

Now that the code has leaked we can audit it ourselves lol

[-]

Lissanro@reddit

A while ago when I decide to test Claude Code out of curiosity with local model (Kimi K2.5 running with llama.cpp), it did not work at all - I had all anthropic domains blocked, it just kept looping over errors about not being able to connect somewhere instead of doing the task. It seems Claude Code is not intended to be used locally. It also required hacking ~/.claude.json to set hasCompletedOnboarding to true, otherwise it wouldn't even let to try anything (I never had Anthropic account and tested Claude Code locally only).

[-]

How to connect Claude Code CLI to a local llama.cpp server

1. CLI (Terminal)

2. VS Code setup with the Claude Code extension installed

Notes