Best setup for agentic coding (largely unsupervised) 8gb VRAM and 32 GB Sys RAM, Olamma Cloud and a frontier sub?

Posted by Song-Historical@reddit | LocalLLaMA | View on Reddit | 3 comments

Hi!

I'm looking for a coding agent workflow where I can run a local model for implementation and something either cloud based ala Olamma Cloud and some sort of frontier subscription (ChatGPT, Claude, whatever) to have continuous coding without hitting usage limits. I've had some success with Qwopus 9B but can only manage 30k tokens on LM Studio with my machine.

I was going to attempt to use pi.dev or oh my pi and replicate some of the features seen here in Lucas Meijer's recent talk. I particularly like the dashboard for code review.

https://youtu.be/fdbXNWkpPMY

While I have some programming experience and a general interest in computer science and math topics (enough to own old copies of the Art of Computer Programming) and can read pseudo code, I'm by no means a full stack programmer and have only done enough system administration and programming to work on hardware projects as a hobby or to understand a library enough to hire someone and not get scammed. I have a lot more UX. experience than anything else.

I need a solid workflow for large projects so I can get back to work, my business partners have dipped out of this space entirely, leaving me stranded effectively as a solo operator. I'm struggling a little to get my bearings. I would use Claude or Codex but keep hitting usage limits.

I need to be able to get into a workflow where I can manage the context well and have a continuous handoff between documentation, memory and context management between a few agents that don't break every update. I'm willing to pay for Claude or ChatGPT to be able to do planning and QA and for research so it can look up documentation and I don't have to resort to maintaining a complex RAG setup for current best practices.

Anyway there are a thousand videos out there, I'm hoping to narrow it down to getting a strong workflow going for under a 100 USD a month. Preferably half that since I have some server costs and marketing costs and cash flow to worry about. Anyone have any success with a similar setup?

[-]

Feeling_Ad_2729@reddit

8GB VRAM is the hard ceiling for "unsupervised agentic." At that size local models hallucinate tools and mis-order steps enough that you'll spend more time fixing than the frontier sub would've cost.

Where I've seen the 2-tier split actually work:

Local (small model): narrow, well-defined subtasks — boilerplate, log-parsing, single-file edits, grep-style retrieval. Qwen Coder 2.5-7B is fine for this.
Frontier: planning, multi-file reasoning, anything that needs "decide what to do next."

Use the frontier as the orchestrator, and have it delegate well-scoped tool calls to the local model only for batch tasks (e.g. "apply this lint fix across 200 files"). Don't try to run the planning on local — that's where it breaks.

On context: if 30k is your ceiling on Qwopus 9B, switch to a 7B at Q4_K_M and you'll get 40-48k with the same RAM. Useful for agentic loops that need tool history.

[-]

Song-Historical@reddit (OP)

Yes I understand. I need a hand off workflow/agent system so the larger context better model does the file reading and planning, breaks a a task down into small actionable steps then only assigns the decomposed task to the smaller model, which follows a set workflow, only editing certain lines. The local agent would have to edit the file, then read it and make a difference file or something, make a report for what it did. Then it would go back to a larger model for QA.

Is that viable? Maybe a Gemma 4 based model makes more sense the smaller it gets. I wanted to see if anyone actually had this workflow.

I'd appreciate if someone actually tried it or had done it before.

[-]

Feeling_Ad_2729@reddit

Viable. Claude Code subagents do this in-framework but it's frontier→frontier. For a real big→small split, Aider's architect/editor mode is the reference: https://aider.chat/docs/usage/modes.html — architect plans, editor applies.

Two failure modes to design around:

Small models lie about what they did. Require diff verification, not narrative reports. The editor outputs a patch; the architect diffs it against the plan.
Edge cases (merge conflicts, type errors, ambiguous refactors) need explicit fail-fast rules or the small model papers over them and moves on.

Cheap benchmark: take a bug-fix from a real repo you already know, run the split on it, check if the editor's diff matches what a frontier-only run would've produced. Tells you if the handoff preserves enough context without needing to build a new test harness.