Local AI setup 1x5090, 5x3090

Posted by Emergency_Fuel_2988@reddit | LocalLLaMA | View on Reddit | 34 comments

What I’ve been building lately: a local multi-model AI stack that’s getting kind of wild (in a good way)

Been heads-down working on a local AI stack that’s all about fast iteration and strong reasoning, fully running on consumer GPUs. It’s still evolving, but here’s what the current setup looks like:

🧑‍💻 Coding Assistant

Model: Devstral Q6 on LMStudio
Specs: Q4 KV cache, 128K context, running on a 5090
Getting \~72 tokens/sec and still have 4GB VRAM free. Might try upping the quant if quality holds, or keep it as-is to push for a 40K token context experiment later.

🧠 Reasoning Engine

Model: Magistral Q4 on LMStudio
Specs: Q8 KV cache, 128K context, running on a single 3090
Tuned more for heavy-duty reasoning tasks. Performs effectively up to 40K context.

🧪 Eval + Experimentation

Using local Arize Phoenix for evals, tracing, and tweaking. Super useful to visualize what’s actually happening under the hood.

📁 Codebase Indexing

Using: Roo Code

🔜 What’s next

This stack is definitely becoming a bit of a GPU jungle, but the speed and flexibility it gives are worth it.

If you're working on similar local inference workflows—or know a good way to do smart GPU assignment in multi-model setups—I’m super interested in this one challenge:

When a smaller model fails (say, after 3 tries), auto-escalate to a larger model with the same context, and save the larger model’s response as a reference for the smaller one in the future. Would be awesome to see something like that integrated into Roo Code.