Never thought I'd say this but Qwen3.7-Max is now running our 2,000-tester office agent. Better than opus 4.8.

Posted by Least-Orange8487@reddit | LocalLLaMA | View on Reddit | 12 comments

Note: Attached marketing video I actually also made using Qwen3.7-Max .... apart from my poor designing skills, it's actually very impressive how good it is.

Anyway, I built PocketBot, an iOS personal Chief of Staff with a to-do-list interface. It has \~2,000 TestFlight users now, and we just moved the main LLM behind the harness to Qwen3.7-Max.

The product surface is simple:

- read work context across all the integrations

- find the actual follow-ups

- turn them into approve/reject actions

- draft the email/doc/message/deck if needed

- never send or change anything without user approval

Qwen3.7-Max is annoyingly good at this. Like, seriously, I've been spending $400/mo in both codex and claude subscriptions (admittedly, it is an obsession now), and this genuinely made me question those subscriptions.

Not because it “chats” better, but because messy office work is a long-context tool-use problem: stale emails, meeting notes, calendar constraints, Slack threads, Notion pages, Drive docs, half-finished drafts, and then the user asking

“what am I about to miss?”

But Qwen3.7-Max is cloud-only. So now I want to answer the question this sub actually cares about:

How much of this agent can be moved to local/open-weight models before it becomes unreliable?

My current split idea:

- Qwen3.7-Max: hard planning, long-context synthesis, final drafting

- local Qwen: task extraction, classification, privacy-sensitive preprocessing,

memory updates

- maybe small local model: routing / “is this even actionable?”

Models I'm planning to test are as follows:

- Qwen3.7-Max as the cloud baseline

- Qwen3.6-27B local

- Qwen3.6-35B-A3B local

- maybe Gemma 4 / GLM / DeepSeek for comparison

For people here running local agents seriously: What would you test first?

I care much less about “can it produce a pretty demo?” and more about whether a local

model can be trusted to extract actions from private work context without inventing work that does not exist.

If there is interest, I’ll post the results with:

- model / quant

- runtime

- hardware

- context size

- tool-call errors

- hallucinated actions

- missed actions

- latency

- cost

- real failure examples

- etc, etc, etc

Cheerio everyone

[-]

JockY@reddit

Wait, how are you running this locally?

Or is this another commercial for a cloud service in the local llama subreddit?

[-]

Least-Orange8487@reddit (OP)

Qwen3.7-Max is not local. We’re using it as the cloud baseline. The reason I posted here is because I’m trying to measure what can be replaced by local/open-weight Qwen models in the same agent loop.

Sorry if confusing post, just trying to get some advice, a lot of people would prefer to have full control over their privacy so we're trying to find the best way to do it...

[-]

Lemondifficult22@reddit

Excellent use case and description. Keep the prompts, reasoning, tool calls, and tool responses. Then you can train lora or fine tune to hopefully get the same "bias to certain answers" as the max model. Hope that helps.

[-]

Least-Orange8487@reddit (OP)

Thanks, that's very interesting!

[-]

AutoModerator@reddit

Hello! Your post was removed as you do not have sufficient karma on r/LocalLLaMa. We are doing this in response to the large volume of spam we are unfortunately experiencing. Please participate in the sub (through comments), gain the minimum of 5 karma and then re-post

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[-]

Foreign_Risk_2031@reddit

You copied this text from a terminal window

[-]

Least-Orange8487@reddit (OP)

Nah. Wait yes. Technically. I write it myself but get claude to improve my writing cause it's horrendous. But I even tried making small "to do lists" in the post body as I thought it would be funny... nevermind.

[-]

Foreign_Risk_2031@reddit

You have a line break where it doesn’t make sense lol- it copied a \n when it was actually word wrap

[-]

Least-Orange8487@reddit (OP)

Ah I see thanks. Does look a bit atrocious admittedly haha.

[-]

Least-Orange8487@reddit (OP)

Also yeah obviously if anyone wants to give it a spin or has recommendations/ideas, trust me I'd be more than happy and eternally grateful: https://testflight.apple.com/join/EdDHgYJT

[-]

NoFaithlessness951@reddit

No local don't care

[-]

Least-Orange8487@reddit (OP)

Yeah we want local too 😞