I watched today a YouTube where the author of pi.dev shows that opencode tries to "optimize prompt (context?)" often. This means saving tokens, but breaking the cache.
I don't know if this is true, but would match your experience.
If you don't have a supercomputer, use Pi (https://shittycodingagent.ai/). The system prompt is a lot smaller, saving you context and PP time. I haven't used it with gemma much yet, but with Qwen3.6 it's been great.
You can give https://github.com/0xku/kon a shot as well (i'm the author)
Its extremely lightweight (actual code as well as the system prompt) and works very well with local models.
I tested gemma-4-24B-A4B quite a bit and it overall did well but would get into output loops fairly frequently where it would repeat the exact same thing until it ran out of context. I finally gave up. This was using opencode and ollama.
Which part in particular? The `--repeat-penalty 1.15`? I played with this a bit but it seemed to cause problems with tool calls where the structured output necessarily requires repetition.
Is opencode good with local models my hardware setup is RTX 3090 and 64GB system RAM. Currently running qwen3.6 35B IQ4_NL context 131k. Would it be good for local coding with opencode ?
Same hardware. Same model but using the unsloth Q4_K_S/M quants (I switch depending upon context required or tasks). I have been enjoying coding using OpenCode, at least it works great for my use-cases.
Getting 110+ tok/sec for about 100k context I can iterate pretty quickly locally. It can handle slightly complex coding too but requires more attention and better prompting from my end.
For highly complex coding tasks, I generate detailed plans with Codex GPT-5.4 or 5.3, then make Qwen3.6 do the grunt work. Finally, review/refactor code using Codex/GPT models, then share back findings/fix plans with Qwen3.6 to get going 💪
No, it’s not faster with q8_0 but, gives more room for context seemingly. Which is why I try to not use them and manage with relatively smaller context.
I’ve noticed prompt processing time increases drastically, and the tok generation drops down to like 100/s from 110/s.
jacek2023@reddit (OP)
it still works correctly
SnooPaintings8639@reddit
I watched today a YouTube where the author of pi.dev shows that opencode tries to "optimize prompt (context?)" often. This means saving tokens, but breaking the cache.
I don't know if this is true, but would match your experience.
jacek2023@reddit (OP)
I debugged it in llama.cpp, it's not "optimize prompt" it's just fucked up, moving system-reminder breaking cache in llama.cpp totally
SnooPaintings8639@reddit
lol
sine120@reddit
If you don't have a supercomputer, use Pi (https://shittycodingagent.ai/). The system prompt is a lot smaller, saving you context and PP time. I haven't used it with gemma much yet, but with Qwen3.6 it's been great.
Haiku-575@reddit
I've been testing with Qwen3.6 as well. It is, in fact, great. https://pi.dev/ is the same thing if you want a less shitty URL.
Ill-Fishing-1451@reddit
Opencode prunes context sometimes, which causes reprocessing the whole cache. This is annoying for llama.cpp backend.
jacek2023@reddit (OP)
check my comment below about it
Weird_Search_4723@reddit
You can give https://github.com/0xku/kon a shot as well (i'm the author)
Its extremely lightweight (actual code as well as the system prompt) and works very well with local models.
I've posted about it recently as well https://www.reddit.com/r/LocalLLaMA/comments/1shkqj5/gemma426ba4b_with_my_coding_agent_kon/
silenceimpaired@reddit
I've never tried to use models for coding. I'll have to check these two out. Love to know what others are using.
jacek2023@reddit (OP)
what do you use them for?
silenceimpaired@reddit
I should add I have used them for coding but only in a copy/paste type way.
jacek2023@reddit (OP)
this is a different coding, agentic coding is much more complex/demanding
silenceimpaired@reddit
Yeah, I know. Hence why I am looking for insight before tackling it.
silenceimpaired@reddit
Writing - fix grammar, spelling, and brainstorming. But I have an idea for a writing tool I want to make so I need explore coding assist solutions.
notlesh@reddit
I tested gemma-4-24B-A4B quite a bit and it overall did well but would get into output loops fairly frequently where it would repeat the exact same thing until it ran out of context. I finally gave up. This was using opencode and ollama.
jacek2023@reddit (OP)
I don't see loops anymore with that command
notlesh@reddit
Which part in particular? The `--repeat-penalty 1.15`? I played with this a bit but it seemed to cause problems with tool calls where the structured output necessarily requires repetition.
jacek2023@reddit (OP)
no idea, I was experimenting
Clean_Initial_9618@reddit
Is opencode good with local models my hardware setup is RTX 3090 and 64GB system RAM. Currently running qwen3.6 35B IQ4_NL context 131k. Would it be good for local coding with opencode ?
a2islife@reddit
Same hardware. Same model but using the unsloth Q4_K_S/M quants (I switch depending upon context required or tasks). I have been enjoying coding using OpenCode, at least it works great for my use-cases.
Getting 110+ tok/sec for about 100k context I can iterate pretty quickly locally. It can handle slightly complex coding too but requires more attention and better prompting from my end.
For highly complex coding tasks, I generate detailed plans with Codex GPT-5.4 or 5.3, then make Qwen3.6 do the grunt work. Finally, review/refactor code using Codex/GPT models, then share back findings/fix plans with Qwen3.6 to get going 💪
jacek2023@reddit (OP)
I tested q8_0 cache but it was slower? Is it faster on your side?
a2islife@reddit
No, it’s not faster with q8_0 but, gives more room for context seemingly. Which is why I try to not use them and manage with relatively smaller context.
I’ve noticed prompt processing time increases drastically, and the tok generation drops down to like 100/s from 110/s.
jacek2023@reddit (OP)
Yes that's why I am staying on default
Certain-Cod-1404@reddit
Yes its really good, but obviously its a SLM and agentic coding is not perfect even with frontier models
stopbanni@reddit
Best with local LLMs for me currently is hermes agent, works with up to Qwen3.5 4B