Never thought I'd say this but Qwen3.7-Max is now running our 2,000-tester office agent. Better than opus 4.8.

Posted by Least-Orange8487@reddit | LocalLLaMA | View on Reddit | 12 comments

Note: Attached marketing video I actually also made using Qwen3.7-Max .... apart from my poor designing skills, it's actually very impressive how good it is.

Anyway, I built PocketBot, an iOS personal Chief of Staff with a to-do-list interface. It has \~2,000 TestFlight users now, and we just moved the main LLM behind the harness to Qwen3.7-Max.

The product surface is simple:

- read work context across all the integrations

- find the actual follow-ups

- turn them into approve/reject actions

- draft the email/doc/message/deck if needed

- never send or change anything without user approval

Qwen3.7-Max is annoyingly good at this. Like, seriously, I've been spending $400/mo in both codex and claude subscriptions (admittedly, it is an obsession now), and this genuinely made me question those subscriptions.

Not because it “chats” better, but because messy office work is a long-context tool-use problem: stale emails, meeting notes, calendar constraints, Slack threads, Notion pages, Drive docs, half-finished drafts, and then the user asking

“what am I about to miss?”

But Qwen3.7-Max is cloud-only. So now I want to answer the question this sub actually cares about:

How much of this agent can be moved to local/open-weight models before it becomes unreliable?

My current split idea:

- Qwen3.7-Max: hard planning, long-context synthesis, final drafting

- local Qwen: task extraction, classification, privacy-sensitive preprocessing,

memory updates

- maybe small local model: routing / “is this even actionable?”

Models I'm planning to test are as follows:

- Qwen3.7-Max as the cloud baseline

- Qwen3.6-27B local

- Qwen3.6-35B-A3B local

- maybe Gemma 4 / GLM / DeepSeek for comparison

For people here running local agents seriously: What would you test first?

I care much less about “can it produce a pretty demo?” and more about whether a local

model can be trusted to extract actions from private work context without inventing work that does not exist.

If there is interest, I’ll post the results with:

- model / quant

- runtime

- hardware

- context size

- tool-call errors

- hallucinated actions

- missed actions

- latency

- cost

- real failure examples

- etc, etc, etc

Cheerio everyone