Local AI video pipeline review: Qwen3 27B beat Gemma 4 26B for tool calling

Posted by Practical_Low29@reddit | LocalLLaMA | View on Reddit | 14 comments

Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.

Tool calling reliability was where the two diverged. Gemma 4 26B kept getting stuck in tool-call loops on his rig. Qwen 3.6 27B handled the same orchestration cleanly, no wasted thinking tokens. That gap is bigger than benchmark numbers suggest once you push real agent workflows through it.

For images he ran Said Image Turbo locally off Hugging Face. Open weights, no API spend. Solid for meme-style cards. Portrait shots are where you'd probably reach for a Flux or Seedream call instead.

Orchestration was OpenCode end-to-end. Context window climbed to 174K tokens and the to-do list wasn't fully completed in one shot. He stepped away from the rig mid-run and came back to a partial result, which is honestly the realistic version of "AI did the work for me".

For people not wanting to run a 27B model locally, Qwen3 family is on a few inference providers so the API path keeps the same weights without the GPU upfront. Tool-call behavior holds since the model is the same.

If you've benchmarked Qwen3 tool-calling failure rate vs DeepSeek V4 on a specific stack (open-claw, Aider, custom loop), I'd want to see the actual numbers.

[-]

seamonn@reddit

You need to compare Qwen 3.6:27b to the Gemma 4:31b. Both excel at tool calling.

[-]

Last_Mastod0n@reddit

Qwen has shown to be better at coding and vision than Gemma in my experience when comparing their dense models. But also Qwen tends to hallucinate more. Also Gemma is more concise (in a good way). Gemma is also faster at token generation. I have also heard from someone that Gemma is leagues better than Qwen at creative writing.

Both models seem to trade blows in different areas. I have started looking into utilizing both models in my pipeline dynamically. But im not sure its worth the squeeze or not yet.

[-]

GrungeWerX@reddit

I’ve had the opposite experience, to the point I’ve stopped using Gemma because it’s too slow, virtually unusable at long context. I’m waiting for it to be optimized…

[-]

seamonn@reddit

that's a hardware issue.

[-]

GrungeWerX@reddit

Qwen works faster at higher parameter. Sounds like a software issue to me.

[-]

seamonn@reddit

Pretty much. We use Qwen for agentic coding and Gemma for agentic non-coding.

Also, Gemma's vision is better or at least equal to Qwen if configured right. I wrote a post about this.

[-]

ambient_temp_xeno@reddit

Another post comparing dense qwen and gemma 4 moe. This is very sus.

[-]

Practical_Low29@reddit (OP)

Source video, in case anyone wants to watch the full run: https://www.youtube.com/watch?v=ydUBYFlwhyk

[-]

Toastti@reddit

You are essentially comparing a 27B parameter model to a 4B model. One is dense one is MoE

[-]

Hot-Employ-3399@reddit

Not even close. 4B model takes ~2-4GB of vram.

Comparing something to fully fit to vram against something to fully fit to ram makes total sense.

excuse "it's just a moe" is not an excuse as long as it doesn't convince the GPU.

[-]

hidden2u@reddit

Said Image Turbo, which model is this

[-]

ttkciar@reddit

It seems odd to compare a 27B dense to a 26B-A4B, but okay.

A pity they didn't compare Qwen3.6-27B to Gemma-4-31B-it (both dense models).

[-]

Practical_Low29@reddit (OP)

fair point on dense vs A4B, that's on the video creator not me. he was downloading what was available locally when he tried it. agree the apples-to-apples would be Qwen3.6-27B dense vs Gemma 3 27B dense or Gemma 4 31B-it. for what it's worth my reading of the run wasn't "27B beats 26B on benchmarks" it was "this specific model handled tool calling without looping on his rig, that one didn't". it's a deployment finding more than an architecture comparison. but yeah the headline framing oversimplified that.

[-]

illforgetsoonenough@reddit

This is apples to oranges, just a bad comparison. And the video creator didn't share the video here, you did, so don't try to deflect