Gemma 4, llama.cpp, tool calls, and tool results - ChatGPT fixed it for me
Posted by TheProgrammer-231@reddit | LocalLLaMA | View on Reddit | 48 comments
I have been trying to use Gemma 4 for tool calling but kept getting errors like a lot of people.
I asked ChatGPT to help me figure it out. Gave it the chat template, it had me try a few different messages, and the tool calls kept breaking. It could make a tool call but would not take the result (either crash with a 400/500 error or just make another tool call again). ChatGPT suggested I look at the llama.cpp code to figure it out - gave me a few things to search for which I found in common/chat.cpp.
I had it review the code and come up with a fix. Based on the troubleshooting we already did, it was able to figure out some things to try. First few didn't fix it so we added a bunch of logging. Eventually, we got it working though!
This is what ChatGPT had to say about the issues:
- Gemma 4’s template/tool flow is different from the usual OpenAI-ish flow. The raw OpenAI-style assistant/tool history needs to be converted into Gemma-style
tool_responsesat the right point in the pipeline. - In
common_chat_templates_apply_jinja(), the Gemma tool-response conversion needed to happen earlier, before the generic prompt diff / generation-prompt derivation path. - In
common_chat_try_specialized_template(), that same Gemma conversion should not run a second time. - In
workaround::gemma4_model_turn_builder::build(), the synthesized assistant message needed explicit emptycontent. - Biggest actual crash bug: In
workaround::gemma4_model_turn_builder::collect_result(), it was trying to parse arbitrary string tool output as JSON. That blows up on normal tool results like:[DIR] Componentsetc. Once I stopped auto-parsing arbitrary string tool output as JSON and just kept string results as strings, the Gemma continuation path started working.
build() - it added that part based on what it saw in the chat template (needs empty content instead of no content).
My test prompt was a continuation after tool call results were added (User->Assistant w/tool call->Tool result). The tool result happened to start with "[" (directory listing - "[DIR] Components") which tripped up some json parsing code. That is what it's talking about in collect_result() above.
I tested it a bit in my own program and it works! I tested Qwen3.5 and it still works too so it didn't break anything too badly.
It's 100% ChatGPT generated code. Llama.cpp probably doesn't want AI slop code (I hope so anyways) but I still wanted to share it. Maybe it will inspire someone to do whatever is needed to update llama.cpp.
Here is the gemma4_fix.diff I created (from ChatGPT's code). I hope it helps somebody. Should I have posted the updated methods instead of a diff? BTW - this is my first ever Reddit post.
diff --git a/common/chat.cpp b/common/chat.cpp
index 5b93c5887..7fb3ea2de 100644
--- a/common/chat.cpp
+++ b/common/chat.cpp
@@ -1729,59 +1729,60 @@ struct gemma4_model_turn_builder {
}
}
- void collect_result(const json & curr) {
- json response;
- if (curr.contains("content")) {
- const auto & content = curr.at("content");
- if (content.is_string()) {
- // Try to parse the content as JSON; fall back to raw string
- try {
- response = json::parse(content.get<std::string>());
- } catch (...) {
- response = content;
- }
- } else {
- response = content;
- }
- }
-
- std::string name;
-
- // Match name with corresponding tool call
- size_t idx = tool_responses.size();
- if (idx < tool_calls.size()) {
- auto & tc = tool_calls[idx];
- if (tc.contains("function")) {
- name = tc.at("function").value("name", "");
- }
- }
-
- // Fallback to the tool call id
- if (name.empty()) {
- name = curr.value("tool_call_id", "");
- }
-
- tool_responses.push_back({{"name", name}, {"response", response}});
- }
-
- json build() {
- collect();
-
- json msg = {
- {"role", "assistant"},
- {"tool_calls", tool_calls},
- };
- if (!tool_responses.empty()) {
- msg["tool_responses"] = tool_responses;
- }
- if (!content.is_null()) {
- msg["content"] = content;
- }
- if (!reasoning_content.is_null()) {
- msg["reasoning_content"] = reasoning_content;
- }
- return msg;
- }
+void collect_result(const json & curr) {
+json response;
+if (curr.contains("content")) {
+const auto & content = curr.at("content");
+if (content.is_string()) {
+// Keep raw string tool output as-is. Arbitrary tool text is not
+// necessarily valid JSON.
+response = content.get<std::string>();
+} else {
+response = content;
+}
+}
+
+std::string name;
+
+// Match name with corresponding tool call
+size_t idx = tool_responses.size();
+if (idx < tool_calls.size()) {
+auto & tc = tool_calls[idx];
+if (tc.contains("function")) {
+const auto & fn = tc.at("function");
+if (fn.contains("name") && fn.at("name").is_string()) {
+name = fn.at("name").get<std::string>();
+}
+}
+}
+
+// Fallback to the tool call id
+if (name.empty()) {
+name = curr.value("tool_call_id", "");
+}
+
+tool_responses.push_back({{"name", name}, {"response", response}});
+}
+
+json build() {
+collect();
+
+json msg = {
+{"role", "assistant"},
+{"tool_calls", tool_calls},
+{"content", ""},
+};
+if (!tool_responses.empty()) {
+msg["tool_responses"] = tool_responses;
+}
+if (!content.is_null()) {
+msg["content"] = content;
+}
+if (!reasoning_content.is_null()) {
+msg["reasoning_content"] = reasoning_content;
+}
+return msg;
+}
static bool has_content(const json & msg) {
if (!msg.contains("content") || msg.at("content").is_null()) {
@@ -1914,7 +1915,6 @@ std::optional<common_chat_params> common_chat_try_specialized_template(
// Gemma4 format detection
if (src.find("'<|tool_call>call:'") != std::string::npos) {
- workaround::convert_tool_responses_gemma4(params.messages);
return common_chat_params_init_gemma4(tmpl, params);
}
@@ -1958,14 +1958,10 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
workaround::func_args_not_string(params.messages);
}
- params.add_generation_prompt = false;
- std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
- params.add_generation_prompt = true;
- std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
- auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
- params.generation_prompt = diff.right;
-
- params.add_generation_prompt = inputs.add_generation_prompt;
+ const bool is_gemma4 = src.find("'<|tool_call>call:'") != std::string::npos;
+ if (is_gemma4) {
+ workaround::convert_tool_responses_gemma4(params.messages);
+ }
params.extra_context = common_chat_extra_context();
for (auto el : inputs.chat_template_kwargs) {
@@ -2005,6 +2001,24 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
return data;
}
+ if (is_gemma4) {
+ params.add_generation_prompt = inputs.add_generation_prompt;
+ params.generation_prompt = "<|channel>thought\n<channel|>";
+
+ auto result = common_chat_params_init_gemma4(tmpl, params);
+ result.generation_prompt = params.generation_prompt;
+ return result;
+ }
+
+ params.add_generation_prompt = false;
+ std::string no_gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
+ params.add_generation_prompt = true;
+ std::string gen_prompt = common_chat_template_direct_apply_impl(tmpl, params);
+ auto diff = calculate_diff_split(no_gen_prompt, gen_prompt);
+ params.generation_prompt = diff.right;
+
+ params.add_generation_prompt = inputs.add_generation_prompt;
+
if (auto result = common_chat_try_specialized_template(tmpl, src, params)) {
result->generation_prompt = params.generation_prompt;
return *result;
@@ -2187,4 +2201,3 @@ std::map<std::string, bool> common_chat_templates_get_caps(const common_chat_tem
GGML_ASSERT(chat_templates->template_default != nullptr);
return chat_templates->template_default->caps.to_map();
}
-
superdariom@reddit
I found Gemma 4 buggy even after the specialist parser they added a couple of days ago but I haven't tested the code they've added yesterday. Qwen agreed to move back in with me and we just don't mention my disastrous fling with Gemma. I still think of her though.
rosco1502@reddit
This is exactly how I feel 😂
AnOnlineHandle@reddit
I think it might depend on the model and quant. I've tried a 26b it heretic quant which has been amazing in a version of LM Studio updated maybe a week or two ago, best writing model I've found after a long search. I tried a quant of the base 26b model however and it is terrible, looping the same outputs after a little time. The 31B model also seemed worse than the 26b model with occasional errors, though not completely broken.
I've been using the Q4_K_M checkpoint from nohurry/gemma-4-26B-A4B-it-heretic-GUFF which I'd be curious to know if it works for other people having issues. I made a post about how it's the best writing model I've found a few days ago but it got downvoted and I got accused of shilling, but I'm not the one who uploaded it, it's just genuinely the best writing model I've found and I'd like others to know too and potentially start a finetuning ecosystem around it if it works for others.
superdariom@reddit
Yes I think your use case is different. I'm doing tool calling and technical agent based work
AnOnlineHandle@reddit
Yeah it 100% might come down to use case, though in this particular case I noticed that not only was that particular checkpoint good, it was also the only one that seemed stable in my recent'ish LM Studio version, so am curious if the stability issue is checkpoint-based. I assume most people are using quants, and it's possible that many of them are messing something up.
SearchTricky7875@reddit
for llm , use --enable-auto-tool-choice --tool-call-parser gemma4
RipperFox@reddit
Did you already file an issue at llama.cpp's github? They include templates, so they'll have to update, too!
insanemal@reddit
Did you raise a big with llama.cpp?
TheProgrammer-231@reddit (OP)
No, I did not. Maybe I should? I certainly don't want to have to manually apply the patch every time llama.cpp (or at least common/chat.cpp) is updated. ChatGPT modified the code, I didn't think they'd want the AI generated code. I suppose a bug report doesn't have to have code attached to it. "big" is a typo for "bug", right?
insanemal@reddit
Yes and yes.
If you find a bug raise a report. (yeah android autocorrected it for me)
Mention you vibed a fix, and provide a link to your fix, if they want it they will tell you.
The vibecoded fix might be perfect, it might not but it also might help them see a possible solution which can help.
TheProgrammer-231@reddit (OP)
I did create an issue on github for it now that I know more about the problem.
EbbNorth7735@reddit
And creat a PR while your at it to pink the bug to
pfn0@reddit
Was the build you were running very recent? E.g. https://github.com/ggml-org/llama.cpp/pull/21418 went in 3 days ago, and there were probably more fixes since then (PR search lists quite a few)
TheProgrammer-231@reddit (OP)
From a few/several hours ago, I’ve been updating frequently waiting for a fix.
pfn0@reddit
Also, what are your repro steps? Even on a version before that PR merged, I haven't really encountered issues with toolcalling. Admittedly, though, I've barely used gemma4, other than a few contrived tasks with toolcalls.
TheProgrammer-231@reddit (OP)
Well, let me recompile the official version. OK, it's now at version: 8714 (3ba12fed0).
To test, I just used some powershell commands.
First, setup the body with messages: user, assistant (tool call), tool (tool response)
And then submit it and look at results:
And llama-server shows me:
If I apply my patch, manually this time since chat.cpp has changed, and try again then it works (200 response).
pfn0@reddit
super weird, I can't repro your crash.
I'm running an older version of llama.cpp from before that gemma4 specific parser even got merged--maybe that's the bug?
TheProgrammer-231@reddit (OP)
I reverted back to the original chat.cpp and then updated to latest version.
version: 8719 (2dcb7f74e)
From WSL (linux inside of windows) I ran your curl cmd (model name was changed, that's it) and llama-server (running on Windows still) threw a 500 error (as I expected).
So then I went to apply my patch and thought, that json::parse command was the last change I made before it worked... maybe I should start with that. So I changed code in gemma4_model_turn_builder :: collect_result from this (chat.cpp lines 1737 - 1742):
To:
BTW - I had another line between auto s and response for debugging:
LOG_ERR("gemma4 collect_result: content string len=%zu\n", s.size());
With that change ONLY, it worked!
My own program works great with it now too. Agentic kind of thing - I told it to read a file, write the biggest issue to another file, read that file to verify, write another file with possible solutions, read and verify that file - all in one prompt. It did each step, calling tools as needed. It's working great now with Gemma 4.
Also, thank you for taking the time to help.
So, now the question is... why is that throwing a 500 error when the original code is in a try/catch block? Shouldn't the catch block, you know, catch the exception? And, I wonder if the original code works when the result is valid json? And, is the fact that it starts with something that MIGHT be valid json (the '[' in '[DIR]') part of the issue? And, what is the consequence of not parsing it as json if it is json? Hmm.
At least it's a much smaller patch now, if nothing else. ChatGPT and I tried a lot of stuff before we got to that, I guess none of the prior steps were needed.
pfn0@reddit
that's so weird. I wonder if it's a compiler optimization error that causes it to mess up the try/catch
I build on
nvidia/cuda:13.1.0-devel-ubuntu24.04with these as my cmake setupTheProgrammer-231@reddit (OP)
So... it was my build bat file all this time. Can't have multiple -DCMAKE_CXX_FLAGS so the /EHsc was not getting applied. Needed to be on one line like this:
-DCMAKE_CXX_FLAGS="/EHsc /wd4267 /wd4244 /wd4305 /wd4996" \^
And now it works without any changes to the source.
/EHsc is for exception unwinding/catching... json::parse aborts instead of throwing an exception if the flag isn't set.
Ugh, well... at least it works now. Thanks again for the help.
pfn0@reddit
yay, we figured it out. that also went right by me that the multiple duplicate CMAKE_CXX_FLAGS were there (because last one wins and /EHsc wouldn't take effect) -- I completely missed it, too.
aldegr@reddit
I had a feeling this was it, which is why I asked about your build type! I’m glad you figured it out.
TheProgrammer-231@reddit (OP)
I have a .bat file I run from a Developer Command Prompt (visual studio 2022) in Terminal. I think I might need to take a close look at your flags. I might want a few of them. I am using CUDA 13.2. I can't imagine any of that would cause json::parse to fail though.
I posted an issue on github, maybe somebody can reproduce it and fix it. I hope it isn't a me-only issue! I might have to be the one to figure out why json::parse is crashing hard enough to skip the catch block.
zzzUpdate-ninja.bat
pfn0@reddit
is /EHsc a standard llama.cpp build flag? that does change exception handling.
TheProgrammer-231@reddit (OP)
Just saw this - yeah, that was the issue.
TheProgrammer-231@reddit (OP)
Huh, that is weird. Could possibly be a Linux vs Windows thing too. Qwen3.5, gpt-oss, and every other model I’ve tried works for me but Gemma never had (tool results specifically - it’d make the call). I’m on a 5090 which should be similar enough to your 6000 Pro.
pfn0@reddit
maybe it's a dependency thing; windows libraries vs. linux. even on windows, I run llama.cpp in a docker, which makes it linux as well, so my build would be consistent from platform to platform
pfn0@reddit
Thanks for sharing. I tried converting your payload to json, and it looks malformed:
tools is not valid.
pfn0@reddit
What is your build number? you can correlate it to whether that PR is in the build you're running. llama-cli --version should say
TheProgrammer-231@reddit (OP)
version: 8702 (c5ce4bc22). Which https://github.com/ggml-org/llama.cpp/releases says was released 9 hours ago.
TheProgrammer-231@reddit (OP)
I just looked at that link. Seems like it should have been fixed then? But mine was broken still.
LeHiepDuy@reddit
Yours seem to be on par with my experience with tool calling with Gemma 4. While it answers blazing fast, almost all tool calling fail in someway or another. Despite updating to the latest llama.cpp v2.12.0, the problem still persist.
aldegr@reddit
Which platform are you building on, and which build type? Debug? Release?
TheProgrammer-231@reddit (OP)
Windows, Release.
aldegr@reddit
I'm really curious what your original errors were, because the `catch (...)` should fall back to a string if it cannot parse as JSON.
TheProgrammer-231@reddit (OP)
It was my own stupid fault... see the update. Thanks for commenting though.
TheProgrammer-231@reddit (OP)
IDK if you saw the update or the conversation with pfn0 but the only issue seems to be the json::parse call and the catch block does NOT catch it.
TheProgrammer-231@reddit (OP)
I just tried it without that part and llama-server gave me a 500 error when I submitted a request. I did not look any deeper into it though.
TheProgrammer-231@reddit (OP)
Yeah, I’m not convinced that part is necessary. I thought the same as you - catch should’ve gotten it. I’d have to review my ChatGPT session to tell you how it ended up in there.
sunychoudhary@reddit
Nice to see tool calls getting smoother in local setups.
The real test will be how stable it is over longer chains: does it keep the right tool context, does it recover cleanly from bad outputs and how deterministic the calls are
Tool calling looks great in demos, but reliability is what makes it usable.
CommonPurpose1969@reddit
Does anyone else have at the beginning of the response content with E2B and E4B Q8?
KokaOP@reddit
did anyone get the audio working on GPU in small gemma-4 models ??
Thomasedv@reddit
What issues did you have with gemma4?
I use the Q4 MoE variant.
My biggest issues are, when I used Claude Code with it, is some tool calls continually fails, like editing files fails because it can't find the string to replace.
The other issue is a bit worse, lots of looping, but with tools or "I'll do X" and then it just repeats that forever. Which is a bit sad because it's a surprisingly fast model for coding, if it doesn't get the issues that is.
ambient_temp_xeno@reddit
q4km of the 26b moe is a lot worse than 31b.
TheProgrammer-231@reddit (OP)
I could chat with it fine until it made a tool call. Adding the tool results would then crash it. I was using 31B. I did see that looping issue when I was trying different things.
jacek2023@reddit
llama.cpp github may be a better place to discuss changes in the source code :)
SM8085@reddit
True, they don't like LLM additions for some reason. Personally, I think their AGENTS.md is counter-productive: https://github.com/ggml-org/llama.cpp/blob/master/AGENTS.md
I think what they/anybody really need to do is teach people about reasonable Pull Requests. When done properly, it shouldn't matter if it's human or LLM generated IMO. Either the code works or doesn't. Either it's clean enough to pull or isn't.
pfn0@reddit
"for some reason" ... it is a very fine stance to take, flooding the project with vibe coded PRs will not give enough time to properly review and vet all changes.