[-]

Pristine-Woodpecker@reddit

Still randomly stops in OpenCode without getting working code.

[-]

jacek2023@reddit (OP)

GLM Flash is a good model to me. I don't care about benchmarks/leaderboards at all.

[-]

Me neither but GLM Flash just fails to get working code on a lot of problems that I threw at it, that other models can. So it's not surprising that also shows in tests that measure its ability to write code for problems it hasn't ever seen before.

[-]

jacek2023@reddit (OP)

which model do you use for coding then?

[-]

Pristine-Woodpecker@reddit

Qwen3.5 27B or 122B-A10B. Before that previous Qwen-Coder or latest Devstral. All of those worked much better.

[-]

Far_Cat9782@reddit

Glm 4.7 flash is absolutely a sevicable model

[-]

Pristine-Woodpecker@reddit

It would always use Devstral or any Qwen over it. It never managed to write any usable code when I tested it.

[-]

uber-linny@reddit

I thought it was me .. but I've seen it randomly stopping in chat while it's thinking

[-]

jamorham@reddit

I'm not even seeing the thinking, she is just executing tools on after another and doing stuff without any narrative of why. Kind of terrifying not being able to see the reasoning.

[-]

Randomdotmath@reddit

i agree about glm 4.7 flash, but gemma 4 as same as it right now

[-]

aldegr@reddit

No guarantees, but I recommend you try out the parser PR and the included "interleaved" template. I believe there is a gap in the original chat templates.

[-]

FullstackSensei@reddit

Dear community, this is such a recurring theme that it's practically guaranteed every model release has issues either with the model tokenizer or (much much more commonly) inference code.

And while we should help test to catch these bugs early on, we should also refrain from passing judgment about a model's quality, speed, memory, etc at least for the first few days while these issues get worked out.

It's almost every model release: model is horrible -> bugs fixed -> model is great!

[-]

andrej7@reddit

I have all the same errors. I am quite happy how fast it runs on my older PC, but it fails on tool calling in most harnesses I have tried. (Hermes, OpenCode) So, from that perspective it's kind of useless, until they fix it. I wasn't sure if this is just because these models are so distilled and quantized that it simply fails on tool calls, or is it just an issue on the harness side. I hope it's fixable, this could be total game changer for local AI. I was testing it today on my 8GB GPU with full context and unofficial TuboQuant llama-cpp build, it's fast enough to be usable. I would like to see if it's smart enough to be usable too, but because the tool commands doesn't work I can't do any proper testing. However, on the first sight it's promising. I am using (gemma-4-E4B-it-Q5_K_M unsloth v.)

[-]

LostDrengr@reddit

I have been using the E4B model and it has been great. About an hour ago I tried the 26B-A4B and so far I am getting empty chats beyond the first prompt. It does the compute element and the reasoning seems to be a bug there. I am using the 8661 release but will keep looking to see what is the item if it can be tweaked.

[-]

Plasmx@reddit

Did you try the E4B model with a coding agent or tool use? For me that didn’t work because the agent always wanted an user input after a very short time.

[-]

LostDrengr@reddit

I did not unless you count me being the agent! If I have time I will think of a long workflow to try it out though. How are you getting on with it since?

[-]

Plasmx@reddit

I could not get E4B to work as an agent so far. Maybe OpenCode and (oh my) Pi are not supporting the small models correctly, they might need different instructions where I did not dig into yet. Qwen3.5 9B for example does not show this „confirmation seeking bahaviour“, but it struggles with tool calls and gets the arguments wrong which leads to fails. That is really sad because I know those small models are really capable of small coding tasks in normal inference.

Gemma 4 26B is really slow with 16 GB VRAM and the large context of the agent harness. So not really fun to work with for me. In normal inference with smaller context speed is great.

[-]

LostDrengr@reddit

The default context size seems to be quite low (26B), either I am too near the vram limit or there needs to be some turboquant compression to get this to fly.

[-]

boutell@reddit

My outcome was similar I think. Sometimes I could get some code generated with claude code or open code at first, but eventually I would get into a loop of minimal responses or no responses.

[-]

LostDrengr@reddit

I think it was hitting a context wall too that I only noticed a little later.

[-]

Specter_Origin@reddit

I do believe a week worth of wait is a good idea for people who can't handle bugs but Qwen3.5 has been out for over a month and it still sucks in terms of loops and absurd thinking use. So sometimes it may be model and sometimes it might be bugs you just gotta wait and watch I guess.

[-]

FullstackSensei@reddit

Been using 397B at Q4 without any issues.

Did you make sure to follow the recommended parameters? Which quant are you using?

[-]

Specter_Origin@reddit

I did, directly from model card, but I have noticed people are having very different experience if they are serving it via llama.cpp or lmstudio or mlx etc. I did try Q4-6-8 gguf and MLX both via llama.cpp, mlx-vm & lmstudio.

[-]

FullstackSensei@reddit

I'm using vanilla llama.cpp with CUDA+CPU (three 3090s) and ROCm+CPU (three 32GB Mi50s).

Whose quants are you using? Did you check the unsloth documentation to see if you're setting the correct values?

[-]

ormandj@reddit

Did you try ik_llama with the 3x 3090 setup? That’s what I run and it was significantly faster than llamacpp

[-]

juandann@reddit

does ik_llama.cpp already support gemma4 in the main branch?

[-]

INtuitiveTJop@reddit

I put a timer for seven days before I even download and test anything

[-]

IrisColt@reddit

This.

[-]

FlamaVadim@reddit

much worse are people who say models are great before bugs fixed 😖

[-]

MaruluVR@reddit

You dont know what software they are using to run it or for what purpose they are using it so their claims might still be accurate.

[-]

Separate-Forever-447@reddit

That's why it would be more useful if people were more specific about what/how does and doesn't work. Generalizations aren't very helpful. "Works for me": not very useful.

[-]

Separate-Forever-447@reddit

Yeah. it is pretty frustrating. There are definitely harnesses and use cases that are still broken. Yes, right now, even after the round of latest fixes to the tokenizer, runtimes fixes to llama.cpp, and updates to front-ends.

OpenCode is still broken. Maybe its a problem with OpenCode; maybe lingering problems with gemma-4. Does anyone know?

So, the OP's comment "I also tried doing some stuff in OpenCode (it wasn’t even coding), and there were zero problems." is a bit weird.

I did some stuff in OpenCode... but it wasn't coding?

[-]

SlaveZelda@reddit

I know some people who use vllm with their fancy rig. Same kind of people who also never go below q8.

Anyways vllm allows you to run using transformers lib instead of vllm native so you can run most things with the official implementation without bugs.

[-]

Alternative_Elk_4077@reddit

I mean, it's very easy to test models on API that don't currently work locally. That's what I was doing while waiting for the lccp fixes and it seems like a legitimately strong model just from my shallow tinkering

[-]

DistrictAlarming@reddit

Yes indeed, I trust that and spend over 3hours encountered several issue and thought I'm the only one has issue since everyone say it's great, pretend they already use it for a while.

[-]

jacek2023@reddit (OP)

There are many "importers" here. They don't use models locally. They just hype benchmarks

[-]

ObsidianNix@reddit

Even then, benchmarks mean nothing for personal use case. If it works for you, it works for you. Doesn’t mean that it will work well for other peoples scenarios.

[-]

DinoAmino@reddit

Another recurring theme is that passing judgement on models for this only occurs with releases of Western models. The previous Qwen releases were never judged this way.

[-]

AlwaysLateToThaParty@reddit

Yeah, i thought I'd run them all too see, but there are obviously issues. I use llama.cpp. I'll give it a week, update the tools, and try again next week.

[-]

paolog89@reddit

All merged now!

[-]

karmakaze1@reddit

I got it all working with Linux on an AMD R9700 with ROCm 7.1.1 building from origin/master and some command-line tweaks. The only broken thing is that at the very end of output there's an extra <turn|> which should have been interpreted/consumed rather than output. (Something about a problem with parser interacting with other bits which could get fixed in the next days or week.)

[-]

beneath_steel_sky@reddit

More to come https://github.com/ggml-org/llama.cpp/issues/21434

[-]

jacek2023@reddit (OP)

Finally a good time to compile :)

[-]

Ok_Mammoth589@reddit

I just like hearing the fans spin up

[-]

StardockEngineer@reddit

I have the latest llama.cpp and both models fail on edits constantly. 26B worse than 31B. When they work it's great, tho. I've tried Unsloth UD Q6 and Bartowkski Q8

[-]

mnze_brngo_7325@reddit

31B is still failing with pydantic-ai tool calls or proper JSON output (which is the same with pydantic-ai). Getting `Input should be an object` validation errors.

It does work with very simple toy agent setups, but a more complex workflow, that works reliably with almost all LLMs I tested for the past months, fails every time.

Self-compiled llama.cpp (650bf1 commit from today) and the recent quants from unsloth and Bartowski. All have the same behavior.

[-]

jacek2023@reddit (OP)

Is there an issue for that?

[-]

mnze_brngo_7325@reddit

Not from me. It's hard to get a reproducible description of my setup to report.

[-]

jacek2023@reddit (OP)

Maybe you could find way to reproduce that, otherwise how could you expect a fix to appear

[-]

mnze_brngo_7325@reddit

Currently trying to bisect between the working toy example and the existing application to locate where it starts to fall appart.

[-]

ffedee7@reddit

have you encountered what’s the issue? it’s driving me crazy because the performance in google hosted Gemma 4 31B is WAY better than the local one, I tried a bunch of configs with the latest compiled llama.cpp but nothing…

[-]

Ambitious-Cod6424@reddit

I am following llama.cpp to deploy Gemma 4, all my models return unused24 error.

[-]

jacek2023@reddit (OP)

Why?

[-]

Ambitious-Cod6424@reddit

Not fixed yet.

What we have already checked and fixed

We have already ruled out many of the common implementation bugs on our side:

Prompt formatting
We stopped relying on ad hoc Go-side prompting for Gemma 4.
We restored structured messages_json.
We moved the bridge to llama.cpp's own chat-template pipeline (common_chat_templates_init, common_chat_templates_apply).
Thinking / reasoning mode
We explicitly disabled Gemma 4 hidden reasoning budget.
We added the Gemma 4 reasoning token workaround in the native bridge.
JSON / escaping issues
We fixed HTML escaping so -style tokens are not corrupted as \u003c....
Sampler pipeline
We replaced the old custom sampler path with the official common_sampler flow.
We restored top_k, top_p, temperature, and proper sampler state updates.
We added the missing sampler accept step.
Tokenization / decode bugs
We fixed the double- issue by stopping extra special-token insertion during tokenization.
We fixed the unstable token pointer usage in the decode loop.
We added filtering for visible output.
Output parsing
We switched final/streamed output to common_chat_parse instead of raw token text where possible.
GPU-offload workaround
We added the Gemma 4-specific n_gpu_layers = 29 workaround instead of full GPU offload.
Deployment/build issues
We fixed the native bridge build/link path issues.
We confirmed the rebuilt DLL is actually being loaded.
We added debug logging and verified runtime parameters in logs.

What the logs tell us now

The key finding is this:

The model is still generating as its first generated token.

That matters because it means:

the frontend is not inventing the bad output,
the stream renderer is not the root cause,
the prompt is reaching the model,
the bridge is running,
and the failure is happening at the actual model-generation stage.

So the issue is no longer "we forgot a stop token" or "we displayed the text wrong."

It is much deeper than that.

What is most likely still wrong

At this point, the most likely causes are:

Upstream llama.cpp Gemma 4 compatibility is still incomplete in our vendored version
This is the strongest hypothesis.
Gemma 4 support has been changing quickly upstream.
The exact behavior we see matches known Gemma 4 regressions reported by others.
The specific GGUF build may still be problematic with our current runtime
Some Gemma 4 GGUF variants, especially certain conversions/quantizations, are more likely to collapse into output.
Even if the model is not "broken," it may require newer tokenizer/template/runtime handling than our current vendored stack has.
GPU backend behavior may still be interacting badly with Gemma 4
We already mitigated full-offload regressions with gpu_layers=29.
But that may only reduce one failure mode, not fully solve the underlying incompatibility.

Not fixed yet.

[-]

OmarasaurusRex@reddit

The context requirements for the dense model appear to be huge? Not sure if a fix for that is in the works with llama.cpp

The moe model works great though

[-]

Durian881@reddit

It was already fixed for me (on LM Studio) several hours back.

[-]

RevolutionaryGold325@reddit

How much memory does 200k context eat?

[-]

Powerful_Evening5495@reddit

you need to update llama.cpp

it working great now

I am getting 60tokens in 4b model on rtx 3070

[-]

RevolutionaryGold325@reddit

How much memory does 200k context eat?

[-]

jacek2023@reddit (OP)

Not all fixes are merged (see the links), you will need to update later too :)

[-]

Powerful_Evening5495@reddit

i do it every few days , I build from source

[-]

psyclik@reddit

Out of curiosity, why compile instead of container or pre-built if you compile from main ?

[-]

FinBenton@reddit

Last time I tried pre-build ones, there just wasnt fitting ones available for 5090 with latest cuda toolkits and stuff, I dont remember what the issue was but building from source was the only real option.

Plus its really really easy, literally just git pull and the build commands, takes like a minute total and you always have the latest fixes and its actually build for your spesific hardware natively so there are cases where you just get a better performance.

[-]

psyclik@reddit

Oh, I know it’s easy. It’s just that compiling, building the container, redeploying the pod… it’s one extra step. But I got your point.

[-]

max123246@reddit

Why do you need a container? Just install the build tools and run

cmake -B build -DGGML_CUDA=ON

cmake --build build --config Release

[-]

max123246@reddit

They don't ship Linux with CUDA binaries for some odd reason.

[-]

AlwaysLateToThaParty@reddit

It's important to understand that compiling also allows you more control over the architecture you're using. If you have any non-standard hardware, you might need to modify compiler settings for your specific configuration to increase performance.

[-]

psyclik@reddit

I do understand that - experienced swe, not afraid of compiling and my rig has everything required. It’s just an extra step. The point about control seems moot, at least in my case : I don’t compile my kernel, I use packaged binaries, I run a couple of electron stuff, anything python or JS is a supply chain concern (and let’s not kid ourselves, if you dabble in AI you can’t avoid these stacks). And then everything gets deployed in k8s or docker which … well, I won’t compile it. And then there’s your browser. You might very well be more disciplined than I am, more power to you. But for me, I don’t see the point.

[-]

AlwaysLateToThaParty@reddit

If performance, maintenance, and security, aren't concerns for you, all good.

[-]

chickN00dle@reddit

for CUDA

[-]

Powerful_Evening5495@reddit

control

and know how

the repo is very active and when you downlaod new models , you can have alot of commits that dont merg with main fast enough

it fast and easy

[-]

AlwaysLateToThaParty@reddit

It's important to understand that compiling also allows you more control over the architecture you're using. If you have any non-standard hardware, you might need to modify compiler settings for your specific configuration to increase performance.

[-]

Uncle___Marty@reddit

Bro, it's been 8 minutes since we checked the repo. That's at least 63 new versions released .

[-]

Powerful_Evening5495@reddit

people make commits related to models

you can find them in the comments

use stable build if you dont like the fast changes

[-]

jacek2023@reddit (OP)

In my case, it’s just a habit. I’m a C++ developer, so running Git and CMake is not a big deal, sometimes I also build code from a PR to compare it, or I change something in the code myself

[-]

srigi@reddit

You want flip that numbers, like me - I’m updating few times a day. Luckily llama.cpp releases every few hours.

[-]

beneath_steel_sky@reddit

E.g. ngxson said he's going to add audio support in another PR https://github.com/ggml-org/llama.cpp/pull/21309#issuecomment-4180798163

[-]

MaruluVR@reddit

I wonder if it would be fast enough to use as STT for other LLMs as the amount of languages listed sound great

[-]

jacek2023@reddit (OP)

then there is some draft https://github.com/ggml-org/llama.cpp/pull/21421

[-]

Illustrious-Lake2603@reddit

I love this fix. Im getting 60+ tokens with the 26b model on my dual 3060 in Windows! Before it was running at 12-13tps.

[-]

ocarina24@reddit

Which quant do you use ? Q4_K_M ? Q3_K_S ? From Unsloth ?

[-]

Illustrious-Lake2603@reddit

Im using q4_k_m, from LM Studio. My only issue is that I have no idea to get the thinking enabled.

[-]

ocarina24@reddit

You have to create a model.yml by hand to get the Thinking toggle button.

[-]

Randomdotmath@reddit

me too, i accidentally trigger once, but it never happened again.

[-]

These-Dog6141@reddit

when can we expect a way to add vision support for llama.cpp similar to the fix that was availabe for gemma3 where like you load an additional transformer? the audio support seems to be being worked on (see pull request in OP) but what about vision? or is there already a similar way to get it working as before?

[-]

nickm_27@reddit

Vision was supported from the first commit

[-]

These-Dog6141@reddit

okay how to actiavte it llama-server

[-]

nickm_27@reddit

I just use -hf with a hugging face url it loads everything including vision

[-]

These-Dog6141@reddit

thanks will try

[-]

kelvie@reddit

Give it the mmproj file. Run the llama server help into a model if you need help setting it ip

[-]

These-Dog6141@reddit

ok thanks yes i used that mmproj file for gemma3 is it the same file still or a new one?

[-]

juandann@reddit

do they have a timeline for video?

[-]

Euphoric_Emotion5397@reddit

ya, it's really bad. What tool calling works in qwen 3.5 now all breaks.
So i went back to qwen 3.5. I'll wait another month or two to try it. Or maybe qwen 3.6 if it ever comes out open sourced

[-]

evilbarron2@reddit

I wonder how much of this bugginess with AI models and infrastructure is down to AI being used to write the code for AI models and infrastructure.

[-]

Double_Cause4609@reddit

I mean, there's almost certainly been at least one issue introduced by AI, but AI has also helped at least one person produce a good patch.

Honestly the bigger problem is just that there's so many minor tweaks to different model arches that it's hard to maintain a codebase that has all of them.

[-]

evilbarron2@reddit

So more a need for standardization than code quality?

[-]

Double_Cause4609@reddit

If all models just shared the same arch it would be simpler, yeah. I don't know if I'd say that's a "need", per se, though.

[-]

jacek2023@reddit (OP)

Probably it's not about just the bugs in the code, but about the fact that new models have different characteristics/exceptions

theoretically, there are rules against writing AI code in llama.cpp, but from what I see, there are more and more AI-generated PRs

[-]

zipzapbloop@reddit

yeah, no surprise to a lot of you here. it was llama.cpp (thanks u/jacek2023) and my faffing about trying to identify and fix bugs in the gguf were pretty much pointless in the end (except i learned some useful shit i guess). for anyone who cares here's my story this morning.

setup: win11, rtx pro 6000 96gb (blackwell), lm studio serving gemma-4-31b-it Q4_K_M to opencode and qwen code agent harnesses. comparing against qwen3.5-27b which has worked great for tool calling. gemma 4 would get stuck in infinite tool-call loops. completely unusable for agentic work despite google's benchmark claims.

tl;dr

the problem was (as others have already pointed out) lm studio's bundled llama.cpp lacking the gemma 4 specialized parser (PRs #21326, #21327, #21343, #21418). the gguf metadata does seem to have real issues too (missing eog_token_ids, wrong token types on tool-call delimiters), but the current llama.cpp runtime compensates for those automatically. so, woops. i'm clearly a novice here.

the fix: use llama.cpp b8664 or later with --jinja. that's it. grab the pre-built release from github, point it at the stock gguf, done. no gguf patching needed.

and, yeah, benchmarks aren't lying. gemma 4 genuinely is good at tool calling. but "good at tool calling" and "works in your local agent stack today" are different claims, and the gap between them was a handful of missing parser code in the runtime.

if you're on lm studio, sit tight until they update their bundled llama.cpp. or just run llama-server alongside it on a different port.

the whole story

step 1: the a/b curl tests (isolating the failure)

before touching anything, we wanted to prove where the failure actually was. ran identical curl tests against lm studio's openai-compatible endpoint for both models.

test 1 — single tool call (weather tool): both models passed. clean finish_reason: "tool_calls", valid json args. gemma was not broken at basic tool invocation.

test 2 — round trip (tool call → tool result → final answer): both models passed again. gemma accepted the tool result, gave a clean natural language answer, stopped properly.

test 3 — nested json schema (create_task with arrays, enums, nested objects): both passed. gemma handled the richer schema fine.

test 4 — multi-step two-tool chain (search_files → open_file): this is where gemma fell apart. lm studio logs started spamming:

Start to generate a tool call...
Model generated a tool call.
Start to generate a tool call...
Model generated a tool call.

over and over until ctrl-c. qwen completed the same test cleanly. so the failure was specifically in multi-step tool sequencing, not basic tool calling.

step 2: gguf metadata inspection (the red herring that taught me something)

vibed a raw binary parser (no dependencies) to inspect the gguf header. found a few possible problems:

one: tokenizer.ggml.eog_token_ids: completely missing. this is the list that tells llama.cpp when to stop generating. without it, the runtime only knows about EOS (token 1, <eos>). but in multi-step tool flows, <turn|> (token 106) also needs to be recognized as a generation stop point.

two: tool-call delimiter tokens typed wrong:

[48] <|tool_call> — USER_DEFINED (4) instead of CONTROL (3)
[49] <tool_call|> — USER_DEFINED (4) instead of CONTROL (3)
[50] <|tool_response> — USER_DEFINED (4) instead of CONTROL (3)
[51] <tool_response|> — USER_DEFINED (4) instead of CONTROL (3)

three: meanwhile <|tool> (46) and <tool|> (47) were correctly CONTROL. someone missed the inner four during conversion.

four: token 212 </s> typed as NORMAL (1) — this is the one lm studio warns about on load. it's actually an html tag in gemma's vocab (not the real eos), but lm studio gets confused because </s> traditionally means eos in other models.

vibed up a python script that patched the gguf: fixed the token types, added eog_token_ids = [1, 106], rewrote the header and copied \~18gb of tensor data. total size difference: 64 bytes.

result: womp womp. still looped in lm studio. the metadata seemed like real bugs but not the root cause of the looping. and maybe i'm just completely wrong about this.

in any case, this is where u/jacek2023's post pointing at the llama.cpp PRs became the key lead.

step 3: the actual fix — llama.cpp runtime

gemma 4 uses a non-standard tool-call format:

<|tool_call>call:function_name{key:value,key:value}<tool_call|>

with <|"|> for string quoting instead of standard json. every layer of the stack needed new code to handle it, and those fixes literally landed a couple days ago:

PR #21326 (apr 2) — gemma 4 template parser fixes, added normalize_gemma4_to_json() and a dedicated PEG parser
PR #21327 (apr 2) — tool call type detection for nullable/enum schemas
PR #21343 (apr 3) — tokenizer bug where \n\n gets split into two \n tokens, causing garbage in longer sessions
PR #21418 (apr 4) — gemma 4 specialized parser

as others have pointed out, lm studio bundles its own llama.cpp and hadn't pulled any of these yet.

grabbed the official pre-built release from github (b8664, released same day; windows binaries with cuda 13.1 for blackwell). no custom build needed, just a folder of exe + dll files.

launched with:

llama-server.exe ^
  --model gemma-4-31B-it-Q4_K_M.gguf ^
  --host 0.0.0.0 --port 8090 ^
  --n-gpu-layers 60 --ctx-size 262144 ^
  --threads 12 --batch-size 512 --parallel 4 ^
  --flash-attn on ^
  --cache-type-k q8_0 --cache-type-v q8_0 ^
  --mlock --jinja

the --jinja flag tells llama-server to use the model's own chat template instead of a hardcoded one, which i guess is required for gemma 4's non-standard tool format.

step 4: the payoff

re-ran the exact multi-step two-tool test on my patched gguf that caused infinite loops in lm studio:

step	expected	got
1. initial prompt	`search_files` call	`search_files`, `finish_reason: "tool_calls"` ✓
2. after search results	`open_file` call	`open_file` with correct path ✓
3. after file contents	natural language answer + stop	clean summary, `finish_reason: "stop"` ✓

no looping. no repeated tool-call generation. model even included coherent reasoning about which search result was the best match.

then pointed both opencode and qwen code at the llama.cpp endpoint. both are working beautifully now. multi-step tool chains, file reading, bash execution, the whole deal. gemma 4 even successfully adopted my custom agent persona, made jokes, and self-validated its own model by curling its own endpoint. all the stuff that was completely broken before.

step 5: controlled experiment — do the gguf patches even matter? nope lol

this bugged me. changed two things at once (gguf metadata + runtime) and didn't know which one was actually load-bearing. so loaded both the original unpatched gguf AND the patched gguf side by side on llama.cpp b8664 (different ports, same machine, 96gb vram makes this easy) and ran identical tests against both.

lm studio (old llama.cpp)	llama.cpp b8664
original gguf	infinite loop ✗
patched gguf	infinite loop ✗

the original unpatched gguf worked perfectly on b8664. identical behavior across all three steps. the runtime auto-infers the eog tokens and overrides the wrong token types on its own — you can see it in the load logs:

control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden

good to know! i've not really understood these stacks at this level. boo on me.

so: you don't need to patch the gguf. the metadata issues might be real bugs in the file, but llama.cpp b8664 compensates for all of them at runtime. yay.

been testing some more complex agentic stuff in both opencode and qwen code and so far the model is killing it. i'm happy now. 🙌

[-]

idiotiesystemique@reddit

Does this impact people using ollama?

[-]

jacek2023@reddit (OP)

probably, check that post for details https://www.reddit.com/r/LocalLLaMA/comments/1qvq0xe/bashing_ollama_isnt_just_a_pleasure_its_a_duty/

[-]

idiotiesystemique@reddit

I don't care for the drama. I have a setup that works reliably that I use for actual work and I don't have time to fiddle changing it

[-]

jacek2023@reddit (OP)

But bugs in ollama might have been copied from llama.cpp, so it answers your previous question

[-]

Danny_Davitoe@reddit

Not always the case, Devstral 2 came out and llamacpp still can't parse the tool call tokens correctly. I am still waiting for a fix to be merged.

[-]

Specialist_Golf8133@reddit

gemma 4 getting proper llamacpp support is kinda huge tbh. feels like google's models always had weird quirks in the local stack but if this actually makes it smooth, that's a real option for people tired of meta's licensing nonsense. anyone tested it yet with longer contexts or does it still get weird past like 8k?

[-]

RedditUsr2@reddit

Your average person is just downloading LM Studio or whatever. They don't know or care about llama.cpp.

If the goal is to get people to like local LLMs then they need to work when people try them the first time.

[-]

jacek2023@reddit (OP)

The average person uses a web browser to chat with ChatGPT.

LM studio uses llama.cpp.

[-]

RedditUsr2@reddit

I mean the average person using Local at all. I think the goal should be to get more people to use local as well.

[-]

jacek2023@reddit (OP)

What's your point?

[-]

RedditUsr2@reddit

If this keeps happening and the average person cannot use local reliably then local AI is going to stay niche or become even more niche. You think corps are going to keep releasing local models forever to a shrinking niche community?

[-]

jacek2023@reddit (OP)

OK, but who are you addressing this complaint to? Google? authors of LM Studio? LocalLLaMA community?

[-]

RedditUsr2@reddit

The entire local llm community need to stop releasing the half baked buggy releases. It happens everywhere no matter if your using lm studio, ollama, or whatever. Its happened with every major release every time.

[-]

jacek2023@reddit (OP)

so explain to Google that Gemma 4 was released too early and they should wait a few weeks or months

[-]

RedditUsr2@reddit

Google didn't develop these fixes. Google doesn't control the release of ollama / lm studio / the rest.

The average person who tries local here's about a new model, trys it, it sucks, they go back to sammy.

we should try to do better or this will die as a hobby.

[-]

jacek2023@reddit (OP)

You can always request a refund.

[-]

RedditUsr2@reddit

I'll enjoy local llms while I can if we are going to just let it die.

[-]

jacek2023@reddit (OP)

Sora is dead

Meta’s celebrity AI bots are dead

Local AI is far from dead

[-]

RedditUsr2@reddit

Lets keep the trend going by making local more popular then.

[-]

jacek2023@reddit (OP)

by complaining?

[-]

RedditUsr2@reddit

Do you disagree on wanting local Ai to be more popular? or are you disagreeing that it needs to be easier to use to be more popular?

Pretending there is no issue never solved anything.

[-]

sgamer@reddit

these updates have been so fast that i've simply rebased my entire system around compiling llama.cpp directly and using llama-server webui and it's api, lol

[-]

Fortyseven@reddit

Had tools breaking pretty frequently in Opencode at first, but after updating llamacpp, works fine now. So far.

[-]

zipzapbloop@reddit

noticed the issues you're describing using lm studio + opencode. we did a pretty minimal repro on lm studio's openai-compatible endpoint with curl, using the same prompts/tools for qwen3.5-27b and gemma-4-31b-it@q4_k_m.

we found that both models handled the simple case fine. single tool call worked, both also handled the simple round-trip fine (tool call -> tool result -> final answer), both also handled a harder nested json tool schema fine.

so at first it looked like gemma was innocent, but then we tested a tiny multi-step agent flow with 2 tools: search_files, open_file

prompt was basically "find the file most likely related to lm studio tool-call failures, then open it."

qwen behaved normally. first call search_files, second call after fake search results open_file, no weirdness.

but sweat sweat gemma is where it got ugly. on the multi-step flow, lm studio logs started spamming start to generate a tool call... and model generated a tool call.

over and over and over until i came in with a ctrl-c hammer. so yeah, gemma + lm studio/llama.cpp def falls apart once the workflow becomes multi-step/agentic. bummer.

seems pretty consistent with what people in this thread are describing where toy setups seem to work, but more realistic agent/tool workflows break. and parser/template/runtime issues seem like the culprit. which, we've been through all this before.

also worth mentioning. i'm seeing lm studio logging some sketchy tokenizer/control-token stuff on gemma load (this is probably a bug in the model. its type will be overridden, the tokenizer config may be incorrect.

qwen3.5 serious is just way more stable for this use case right now. it's actually useful in the opencode harness. gemma 4 just isn't right now.

if useful i can post the exact curl, but the short version is basic function calling passed, multi-step tool sequencing is where gemma eats shit.

[-]

jacek2023@reddit (OP)

always try to post detailed description of your issue here https://github.com/ggml-org/llama.cpp/issues

but first you should try to reproduce that in llama.cpp server instead lm studio

[-]

zipzapbloop@reddit

but first you should try to reproduce that in llama.cpp server instead lm studio

will do