Interesting-Sock3940

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

Interesting-Sock3940@reddit (OP)

ran it all through my own orchestrator (https://openyabby.com) the system prompt plus the JSON schemas for the tool definitions ate about 2.5k tokens per turn It's a heavy tax on a 32k context window but strictly necessary if you want to gate the execution layer

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

Interesting-Sock3940@reddit (OP)

i'm going to manually patch the jinja file this weekend and re-run the benchmark if that fixes the 12% error rate, it changes the math entirely

Replaced Claude with local Qwen3.6-27B in my multi-agent orchestrator for 2 weeks

Posted by Interesting-Sock3940@reddit | LocalLLaMA | View on Reddit | 149 comments

[-]

Interesting-Sock3940@reddit (OP)

ok will try it

Entire world: We need more GPUs. Meanwhile, Jensen Huang:

Posted by Nunki08@reddit | LocalLLaMA | View on Reddit | 264 comments

[-]

Interesting-Sock3940@reddit

AI can do better than him

Is Qwen3.6 current king for local agentic use?

Posted by HornyGooner4402@reddit | LocalLLaMA | View on Reddit | 150 comments

[-]

Interesting-Sock3940@reddit

most local models look good until you actually let them run agents for more than 5 minutes

Ran K2.6 through a third-party coding benchmark: heres how the figures stand up

Posted by lucasbennett_1@reddit | LocalLLaMA | View on Reddit | 5 comments

[-]

Interesting-Sock3940@reddit

Interesting result. Did K2.6 actually solve more of the coding task or did it mainly score higher because the tool calling and local runtime were more stable?

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

Posted by ex-arman68@reddit | LocalLLaMA | View on Reddit | 380 comments

[-]

Interesting-Sock3940@reddit

This is the kind of local inference progress that actually matters because it turns Qwen 3.6 27B into something closer to a practical coding agent on consumer hardware. The key test is whether MTP only makes it faster or whether it also makes small coding mistakes more likely, because local agents usually fail when one wrong assumption gets carried through a whole repo task

New "major breakthrough?" architecture SubQ

Posted by Daemontatox@reddit | LocalLLaMA | View on Reddit | 37 comments

[-]

Interesting-Sock3940@reddit

is there any way to test it yet because the claims are huge and i’d love to be wrong but without something reproducible it’s hard to know what to make of it?

DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper

Posted by Disastrous_Theme5906@reddit | LocalLLaMA | View on Reddit | 94 comments

[-]

Interesting-Sock3940@reddit

love that we are now measuring the china us ai gap by how efficiently a model can sell tacos and apparently the answer is 10 weeks behind and 17x cheaper lol

The Ultimate LLM Fine-Tuning Guide

Posted by PromptInjection_@reddit | LocalLLaMA | View on Reddit | 8 comments

[-]

Interesting-Sock3940@reddit

solid intro. one gap thats easy to miss when youre learning sft: a finetuned model can ace held-out eval and still lose tool-call format compliance the moment you load it into a real serving harness. the guide ends right where the actual production headache starts. worth a small chapter on validating against the deployment loop, not just gguf conversion

"Second Thoughts" Been playing with adding a small transformer that reads output near the end of generation, and feeds it back near the top as a refinement loop. A quick test of 1.7B model showed drastic improvement in focused tasks (like coding)

Posted by bigattichouse@reddit | LocalLLaMA | View on Reddit | 39 comments

[-]

Interesting-Sock3940@reddit

"out of scope for third thoughts" is the spot where it gets interesting tbh. once a model reviews its own output more than twice, it stops fixing real bugs and starts inventing new ones to justify another pass. how often does your gate fire on inputs that were already fine?