Local AI setup 1x5090, 5x3090
Posted by Emergency_Fuel_2988@reddit | LocalLLaMA | View on Reddit | 34 comments
What I’ve been building lately: a local multi-model AI stack that’s getting kind of wild (in a good way)
Been heads-down working on a local AI stack that’s all about fast iteration and strong reasoning, fully running on consumer GPUs. It’s still evolving, but here’s what the current setup looks like:
🧑💻 Coding Assistant
Model: Devstral Q6 on LMStudio
Specs: Q4 KV cache, 128K context, running on a 5090
Getting \~72 tokens/sec and still have 4GB VRAM free. Might try upping the quant if quality holds, or keep it as-is to push for a 40K token context experiment later.
🧠 Reasoning Engine
Model: Magistral Q4 on LMStudio
Specs: Q8 KV cache, 128K context, running on a single 3090
Tuned more for heavy-duty reasoning tasks. Performs effectively up to 40K context.
🧪 Eval + Experimentation
Using local Arize Phoenix for evals, tracing, and tweaking. Super useful to visualize what’s actually happening under the hood.
📁 Codebase Indexing
Using: Roo Code
- Qwen3 8B embedding model, FP16, 40K context, 4096D embeddings
- Running on a dedicated 3090
- Talking to Qdrant (GPU mode), though having a minor issue where embedding vectors aren’t passing through cleanly—might just need to dig into what’s getting sent/received.
- Would love a way to dedicate part of a GPU just to embedding workloads. Anyone done that? ✅ Indexing status: green
🔜 What’s next
- Testing Kimi-Dev 72B (EXL3 quant @ 5bpw, layer split) across 3x3090s—two for layers, one for the context window—via TextGenWebUI or vLLM on WSL2
- Also experimenting with an 8B reranker model on a single 3090 to improve retrieval quality, still playing around with where it best fits in the workflow
This stack is definitely becoming a bit of a GPU jungle, but the speed and flexibility it gives are worth it.
If you're working on similar local inference workflows—or know a good way to do smart GPU assignment in multi-model setups—I’m super interested in this one challenge:
When a smaller model fails (say, after 3 tries), auto-escalate to a larger model with the same context, and save the larger model’s response as a reference for the smaller one in the future. Would be awesome to see something like that integrated into Roo Code.
Silver_Treat2345@reddit
1 x Gigabyte G292-Z20 with 8 x RTX A5000 (192GB VRAM in total)
1 x old X99 Mainboard with 4 x RTX 3060 (48GB VRAM in total
1 x HP Z20 with 1 x P5000 (16GV VRAM) an 2 x P4000 (8GB VRAM each)
1 x Dell Notebook with Quadro RTX 5000 (16GB VRAM)
Am still pregnant with the idea of two more stacks of each 2times or 4times RTX 4060 TI (16GB VRAM) and RTX 5060 (16GB VRAM) - in the end this decision depends a bit on the usefulness and hardware pricing
Boy, I have learned so much about PCIe Throughput, available Lanes, tensor parallelism, pipeline parallelism, model quantisation (weights and activations), Finetuning, Calibration Sets, KV Cache Size Requirements, vLLM Inference, TensorRT-LLM Inference, differentt architecture generations and their capabilities
Emergency_Fuel_2988@reddit (OP)
You are not alone, I have room for 14 full sized GPUs, Always on a lookout for a 3090 :)
This might cheer you up.
Silver_Treat2345@reddit
What kind of Mainboard and Housing is that? Until now all my setups fit into 19" and standard fractal or workstation cases, but for the additional Stacks planned, I'm thinking about a mining rack with risers. Only the available PCIe Gen4 (for the Ada GPUs) and Gen5 (for the blackwell ones) lanes are concerning me. For 4 GPUs these mostly are supported with enterprise threadripper GPUs and regarding mainboards, which come at really high additional pricetags. The G292-Z20 are pretty affordable on the refurb market for their capabilities. But they only fit Enterprise and workstation cards, but no consumer cards (with powerplugs to their sides). Also they are great for tensor parallelism of pairs of two (in my case I even connected these pairs via NVLink bridges) as each pair itself is sitting on a ppcie switch, that is connected to the CPU wit 16 lanes total. The CPU itself support 128 lanes.
Emergency_Fuel_2988@reddit (OP)
Its a dual xeons workstation inside a mining rig
https://www.amazon.in/dp/B0CG4LM37C
gpupoor@reddit
using these with LM studio is such a joke, you're wasting 5 of your 6 cards.
you've really given a lot of thought to this before dropping $5k in GPUs I see
Emergency_Fuel_2988@reddit (OP)
I am using windows, hence lmstudio. Recently tried setting up vllm using WSL2, had run eel models in the past with tp, using text-gen-webui.
gpupoor@reddit
iirc vllm should work just fine on wsl. in theory all you need to do is create a python virtual environment (to avoid destroying the system python ubuntu depends on) with python -m venv vllmenv and then pip install vllm.
then you can just do "vllm serve modelpath --max-model-len 32768 -tp 4"
kryptkpr@reddit
I get that it's easy but running GGUF on a rig like this is throwing so, so much performance out the window 😞
Emergency_Fuel_2988@reddit (OP)
I know, I prefer exl quants, that’s what the Kimi Dev is going to use, I’m not so sure about the perf hit when the model is split though. I might get 2 nvlinks for the 4 3090s I have.
celsowm@reddit
Would mind to try w8a8_fp8 models on your 5090?
Emergency_Fuel_2988@reddit (OP)
I remember downloading one, but never got it to working, lack of time, Which runtime or software do you run this quant generally on?
celsowm@reddit
Sglang and w8a8 version of any 32 or less billions of params
No_Afternoon_4260@reddit
How fast deepseek r1 ?? Lol
sleepy_roger@reddit
What model did you use to write this post, it's poor AI slop.
FullOf_Bad_Ideas@reddit
I wonder how this flavour of text happened to poison LLMs. Models do this because somewhere, sometime in the past, people selected those kinds of responses as preferred during RLHF and it spread around.
Who are those people????
Emergency_Fuel_2988@reddit (OP)
Glad to see my post passed the Turing Test, made you emotional.
offlinesir@reddit
In fact, it failed (not passed) the Turing Test. The test isn't based on if it creates emotion, but if a human can tell the difference between a human and a machine.
sleepy_roger@reddit
Oh I actually didn't read it, I was interested based on your title but if you can't take the time to write a post why would I take the time to read it?
Emergency_Fuel_2988@reddit (OP)
Totally get that. Just so you know, I actually do take my time, I've spent half an hour doing calligraphy on a single word. I use AI sometimes to help shape thoughts, but I’m still very intentional with what I write. A lot of the time, I’m not even writing for others, just for future me.
Cheers, Lets get back to the shiny tech discussion.
I took 30 minutes to create the below embeddings heatmap, The AI model helped me with the sentences, something my old self would have never imagined creating.
Marksta@reddit
I honestly think the sub should ban posts like this, so tired of rocket emoji AI slop. It's the clearest sign that there isn't words in a post, there are output tokens. And if you waste the time to read it, it's just non sense.
kkgmgfn@reddit
Not many models that support Roo?
Emergency_Fuel_2988@reddit (OP)
Ollama configurations comes with a system prompt, skim that to LM Studio. I've run many models that way.
kkgmgfn@reddit
even ollama can do ollama serve.
What configuration?
Emergency_Fuel_2988@reddit (OP)
Try this - https://ollama.com/fomenks/devstral-small_cline_roocode-64k:latest
SuddenOutlandishness@reddit
Love your setup! I'm building a similar but multi-machine setup. I've got a 32GB M2 Pro MacBook Pro, a 24GB M2 MacBook Air, and soon to arrive I have 128GB M4 Max MacBook Pro coming. I've also got three rk3588 devices (one with 32GB ram and 2 with 16GB) for running small models on - I'm thinking for event / email processors.
Emergency_Fuel_2988@reddit (OP)
Apple is useless, M1 Max 32GB user myself, The memory speeds are barely making 400 GBps, only the ultra versions have something close to a 3090. I invested a lot of time on Metal ecosystem, it simply isn't cut out to run models. My M1 Max even with 28 GB GPU memory running headless, is 1/3 a single 3090 raw power, 10ish Tflops vs 34 ish TFlops
SuddenOutlandishness@reddit
Rude.
Emergency_Fuel_2988@reddit (OP)
Its not you, It apple, It broke my heart badly. I appreciate that they do, The hardware simply isn't meant for inference, Prompts nowadays go long, I mean really long. Power consumption if ultra low, but you wait super long [so no potential power savings], and towards the end, end up saving your time.
sub_RedditTor@reddit
How about 1X 4090 48GB
Or even better 2X AMD W7900 48GB
Have heard or AMD MI50 Instinct
Emergency_Fuel_2988@reddit (OP)
4090 48GB without warranty and future driver level patching (if nvidia decides). I'd rather get a rtx pro 6000 if the budget allows, but logically I'd be paying thrice a 5090, when I could get 3x raw power if I buy three 5090s. I am very optimistic about future breakthroughs in tensor parallelism, which renders this daylight robbery all the more short-sighted and temporary.
Business-Weekend-537@reddit
Have you written a blog post or Reddit post on how you get setup/your workflow?
If not would you be willing to post some links to tutorials you used?
I’ve setup LMstudio and have a single 3090 hooked up, plan on booking up more 3090’s, haven’t ironed out a workflow for everything for coding, have just used LMstudio for trying out some models for code snippets.
Emergency_Fuel_2988@reddit (OP)
LM studio had a nice surprise, It is by far the best ui to manage multiple GPUs.
You_Wen_AzzHu@reddit
How's your coding assistant doing?
Emergency_Fuel_2988@reddit (OP)
Made it make a chrome extension, and the coder model was stuck, thats when a reasoning model made more sense, and coder model alongside a reasoning model, was able to pull through.