we really all are going to make it, aren't we? 2x3090 setup.
Posted by RedShiftedTime@reddit | LocalLLaMA | View on Reddit | 66 comments
i'm blown away. i saw someone made a post the other day about "club-3090" and after having sonnet patch some fixes into it, specifically a sse-session drop bug and a bug with tool-calling, it's fair to say that even "budget" setups like myself will have a path forward soon for only-local-ai.
reference github: https://github.com/noonghunna/club-3090 (not mine)
after getting this running, i was originally using WSL2. fair to say, it was "better" than LM studio but not quite good. t/s was like 30 and pp was around 400....i said fuck it and installed ubuntu as dual boot on the same machien (i'm just not very linux friendly when it's headless, prefer windows RDP) and wow. i'm getting like 4000 pp/s and 113 tk/s with no nvlink. supposedly, nvlink would make it faster.....
either way, i'm very excited about this new local future. qwen 3.6 27b with 262k on 48 GB VRAM feels almost-sonnet level, and it's MUCH faster than cloud. and useful! I had it make some monkey patches and they work fantastic, and well as some relatively useful code reviews. im working now on making it work to handle my ssh sessions on my linux computers now.
wondering what the next upgrade path could be. i was thinking about m5 ultra 512 GB + 4x DGX Sparks (prompt processing speeeeed) but now I'm wondering if we'll reach frontier class intelligence (maybe only domain specific) in smaller models in the next 12 months?
awesome!
yad_aj@reddit
the craziest part is that “local AI” went from: “look guys my 7B model can almost summarize text” to people casually running near-sonnet level coding workflows at usable speeds on consumer hardware
we massively underestimated software/runtime optimization, smaller model intelligence, how fast infra would improve etc etc
honestly wouldn’t even be surprised if domain-specific frontier models fit on prosumer setups within 1–2 years. “everyone gets a research lab” is starting to sound less insane every month lol
Etroarl55@reddit
Wished the hardware side could catch up, it’s only the software side advancing. For most people double or more 3090 setups are the realistic height most people can achieve.
techdevjp@reddit
That will come with time. 3090s are relatively affordable because they're two generations old. It won't be all that long before RTX PRO 5000 and 6000 boards are one generation old. A year or two later they will be two generations old. It will take time, but older cards that can still run local models well WILL fall in price.
Over the next few years as memory supply catches up with demand, current gen cards will fall in price which will put further pressure on older cards.
It's just a waiting game, really.
2Norn@reddit
you are assuming we get more efficient cards released with higher memory and bandwith soon...
nvidia skipped a memory jump for like 2-3 generations by now if market was moving normally 5080 would have been 24gb and lower cards would have 16/24gb variants
we had 1080 ti with 11gb 9 fuckin years ago and now the 2nd best card is still 16gb
btw there is almost no chance we get new cards in 2026. and 2027 doesn't look good either. so we are stuck with what we have for about 18 months more. and everyone is slowly catching up with ai, 3090s will be non existent very soon as well.
OttoRenner@reddit
One has to add that the vast majority of us will not be alble to get the hands on new cards for years even if they came out tomorrow. The ones with already very good hardware (=with deep pockets) will buy up us much as they can and will most like NOT sell their old rig, because more VRAM is still better and why demolish a running setup when you can just add more to it (headless server and what not).
Will the prices dipp when new cards/technology drops? Perhaps a little for a short moment. But the news will hype up the new cards and local AI so much, that it will activate the demand for local AI for companies/governments and private citizens...and since most of them will still not be able to buy the new tech (too expensive and out of stock on day1) they will look for the next best tech...and the second best...and so on, increasing the prices again.
PLUS The more efficiently local AI runs on consumer hardware, the more people from outside the nerd/tech-bro bubble will start using it and will want to upgrade as well...and here we are entering a positive feedback loop (or we are in it already for some time), where consumer hardware will get even more attention from buyers and developers alike, leading to increased demand, rising prices and (ever) faster/better models...leading to...
Yes, there will be only so much you can squeeze out of the old cards. But I'm certain we are only starting to understand how to improve AI in general and especially on older hardware.
I'm saying this for months: we are in a post apocalyptic FALLOUT scenario where people will put together frankenstein builds with tech from 5 generations. Every PCIe lane will be used, no matter the speed.
I pulled a 3090ti and a 3090 a month ago for 900€ each. They are now at around 1100€! That's insane
2Norn@reddit
damn 3090s here still are like 550€ here
maybe i should grab as much as possible and sell on ebay globally sounds like ez money
Etroarl55@reddit
Affordable 3090s are around 1100 usd where I live in Canada.
We are in a time where anything with 24gb or at 24gb vram is being bought up and used. I think the worst case scenario is manufacturers see this and further abstain from giving us too many 24gb+ vram GPUs. It will directly hurt their business if everyone can suddenly run a Claude opus at home for free.
UnethicalExperiments@reddit
3060s on a Gen 4 X4 setup is getting me reasonable results. Im planning on doing a post later on to really get things going. Prior to this I just used ollama with openwebui.
Now I'm using llama.cpp and getting pretty decent results with qwen 3.6 35b q8 with 200k context. About 70t/s with 4 3060s
techdevjp@reddit
I'm very interested in this. I have two 3060 12GB cards in boxes here (and one 3080 10gig in my current PC). Very interested in hearing how similar hardware has performed for you.
mycycle_@reddit
I think they know and thats why the hardware side isnt catching up. It is openai who started the ram shortage after all, not the manufacturers (at least initially, they are complicit) this is partially why competing companies are not suing this 2000’s era micron price hiking, if we had the ram in a few years they’d be cooked. What will support this hypothesis is if the manufacturers do not increase the supply production significantly to meet the demand, instead preferring the current price margins.
LaysWellWithOthers@reddit
I am sooooo glad that I managed to cobble together a 4x3090 rig before RAMageddon.
The thing that drove me to do so was based on my feeling that it was likely that local models would catch up, but I did not expect them to do so as quickly as they did.
I started playing with local AI ten years ago and it is absolutely mind boggling what is now possible.
I do many things with my rig, but presently I have it setup as a self improving (profitable) agentic day-trading bot.
Proof_Wing_7716@reddit
What sort of things were you able to do with local ai ten years ago? Must be fascinating to have seen the developments first hand.
rhythmdev@reddit
Do you use pcie 3 or 4 for your rig?
tishaban98@reddit
I would also partially thank the US administration for trying to limit the GPU technology going into china. I don’t think the major Chinese AI labs would’ve been as focused on optimization if they had access to better GPUs
techdevjp@reddit
Yep, hilariously short-sighted action taken by the US on this. The result is hugely beneficial to everyone. (I expect more short-sighted actions from the US such as trying to ban the use of Chinese LLMs by Americans.)
CulturalKing5623@reddit
I set my server up for local LLM last year and played around with the qwen2.5 coder models. Thought it was neat but it was never perform reliably enough to fit in my workflow.
Came back a few weeks ago to 3.6 MoE with pi.dev on the exact same server and it's my new daily driver. My work is fairly straight forward (data engineer) and it's not enterprise scale. I need a coding partner, not an autonomous agent so this the perfect replacement for Github's exorbitant new pricing model.
It's a massive leap.
Blues520@reddit
Qwen 2.5 coder was the goat
Veearrsix@reddit
I’ve been shouting from the rooftops that this was going to happen, and will continue.
jedsk@reddit
Let’s go!
CoruNethronX@reddit
Remember emplacing a chat template manually to the non-instruction-tuned 7b models, like llama, in mid 2023 and getting em some job done. Fantastic feeling. Then mistral appeared on the stage. Then epic battle of frankenmerges on LLM arena... Sometimes I really want to dig my archived HDD's for elder top-of-a-month models and compare that experience to something like Qwen3.5 9B.
portmanteaudition@reddit
There's an effort for this from Switzerland right now with the web archive folk
reflectingfortitude@reddit
awesome repository
FuyuNVM@reddit
You mention "some fixes", "specifically a sse-session drop bug and a bug with tool-calling". Are those already merged into club-3090 or wherever they need to go?
RedShiftedTime@reddit (OP)
I didn't PR them :( haven't had time.
threano@reddit
I'm running a single 3090 and I cant get Gemma or Qwen 27b to answer "can cats sense war?" Without ollama timing out. Do I just need another card?
urarthur@reddit
what the benefit of the 2nd card? run fp8?
hurdurdur7@reddit
More vram.
jikilan_@reddit
Context, tensor parallelism
redonculous@reddit
Please trickle down to us dual 3060 users next!
joost00719@reddit
I'm running a 5070 ti + 2x 3060 for 40gb vram.
It's not the fastest but with MTP I get like 22-28 t/s on qwen 3.6 27b with full 256k cache
Sear_Oc@reddit
Which motherboard? I'm with 5070 ti + 4070 super on 28 vram, would loce to add another to the stack. Btw nice one!!
TheOnlyBen2@reddit
The repo he shared is for all Ampere GPU despite the name. You should be able to pick any proposed configuration and lower the context size and make it works if the base model can fit
tuura032@reddit
You could probably have AI write you the setup file for the 3060s, once you know which path you select.
theaaronlockhart@reddit
Once dflash support is better (KV quant), it’ll be even better. I got up to 160 TPS on dflash with just P2P, but having agents in parallel is more worth it for me.
TheOnlyBen2@reddit
What needs to be better on Dflash ?
theaaronlockhart@reddit
The drafter model for Qwen3.6-27B, which I mainly use, isn’t done training. Also, currently you need FP16 KV cache to use it on VLLM (at least according to club-3090).
sarcasmguy1@reddit
Are you running LM studio on ubuntu or something else?
RedShiftedTime@reddit (OP)
It's in the post what I'm running.
tuura032@reddit
I just set up club-3090 on WSL2 and it's been great! I'm getting around 70 tps in their built in benchmark, power capped at 60%. Might dual boot but I may or may not get around to it.
RedShiftedTime@reddit (OP)
43% faster on Linux!
sleepnow@reddit
Is the box still usable for other tasks as it's doing inference, or does it need to be dedicated?
jikilan_@reddit
Just run a Ubuntu desktop, you can work/play while doing inference. Condition is must have iGPU and max your ram
sleepnow@reddit
I run Fedora as my primary OS. I'd like to get a second 3090 so that I can make productive use of local LLMs but I've been waiting until we were 'there' but it sounds like we're finally there with these models.
jikilan_@reddit
I want to point out that…. In my opinion…. Yes… they’re big improvements but we are not there yet. I still need to check the free Gemini or ChatGPT from time to time when I do doubt the response from local models.
Local models are not healthy in development as well. They are not updated and just a few big players only. We have limited choice for those gpu poor.
sleepnow@reddit
I was thinking along the lines of replacing cloud models with local for several of the agents from my crew, but I'd still use the cloud models for the heavy lifting. My use case is largely coding and research and I currently harness with opencode + OAI also though I've been experimenting with Hermes for general automation and system-level stuff.
There's something really satisfying about running LLMs locally. If adding a second card would allow me to run a model that I can be productive with rather than just a toy that 'sort-of' works (which has been my experience with a single 3090) then I'd go for it.
RedShiftedTime@reddit (OP)
GPUs are hammered, but it's otherwise responsive. This rig also has a 14900KS and 96 GB DDR5 that is only holding m-map. Not really using it for anything else.
munkiemagik@reddit
As long as you arent taxing the gpus with any other workloads which could suddnly cause you to eat up VRAM your inference engine was expecting to use, do whatever else you want. Its not like inference needs much from your CPU or system RAM (assuming everyting is beign done in VRAM) leaving them free for other tasks.
I'm currenly pondering switching my GPU 'server' over from ubuntu desktop to proxmox hypervisor and putting the LLM functions into an LXC.
At first I tought I wanted a linux desktop to hand, plus it was easier being new to linux to setup/configure/experiment everything within a desktop environment, But now for linux i make do with VM's on my other proxmox nodes and for the LLMs I have a better handle on it that the desktop environemnt is no longr a necessity.
The only time I really use the ubuntu desktop anymore is when I am being lazy with setting up config.yaml for llama-swap for newly downloaded gguf, Its just convenient and easy to mouse-click copy/paste multiple pwd and long-winded gguf filenames from terminal into gui editor.
j_osb@reddit
Do note that their cards are powerlimited. Linux will be faster for CUDA compute, yes, but not by a margin of 43% vs WSL2 for this workload.
RedShiftedTime@reddit (OP)
Mine are also power limited! 250 for the air cooled and 275 for the liquid cooled. I think they are usually 420 and 450? Only lost 8 tk/s!
tuura032@reddit
The real solution is a dedicated system for it. I need a rack first 😅
rhythmdev@reddit
Try 4 x 3090’s next it will blow you away
Worldly-Entrance-948@reddit
Local models finally feeling *actually* useful instead of just impressive is such a huge shift, and honestly the fact that a dual 3090 box can already do real coding work says a lot about where this is headed.
MisticRain69@reddit
qwen 3.6 27b Q8 is crazy good exp if using f16 KV cache first small model ive used that is this consistent just chews through all the web dev I throw at it
IrisColt@reddit
Two RTX 3090 would be my sweet point, so yes.
Tough_Frame4022@reddit
I'm running 1 million tokens of context with a 3090 with several models and types with no context rot.
Automatic-Arm8153@reddit
Impossible to have no context rot.
do_u_think_im_spooky@reddit
Taking inspiration from that repo and will setup something but for 5060tis
TheManicProgrammer@reddit
Do keep me updated!
PlusLoquat1482@reddit
we are so back lol
2x3090 being a legit local AI setup is still funny to me. Like yes it’s cursed, hot, power hungry, and held together by Linux pain, but also… it works?
The big shift is that local doesn’t feel like a toy anymore. It’s not always frontier-model good, but for repo work, patching, review, shell/tool loops, etc. it’s getting useful fast.
fiddlerwoaroof@reddit
I’ve been pretty happy with my 1xMI100 setup
portmanteaudition@reddit
People are going to be very disappointed when the M5 Ultra is not available for over a year with that much memory.
Icy-Pay7479@reddit
Its great for agents but for coding it still feels meh to me. But I'm using codex to plan/review on a $20 plan and the combo gets a lot of mileage so far.
xilvar@reddit
3 years ago I predicted 2.5x improvement year over year on cost at the same quality at the same speed. With corollary of being able to exchange those things.
In general that has born out.
BringMeTheBoreWorms@reddit
Whoa! Step aside for mr Nostradamus over here
Woof9000@reddit
wow, impressive prediction, do you want an award? but you gonna have to share your award with a 1000 other guys tirelessly predicting obvious things here, daily.
Fabulous_Fact_606@reddit
Got a 3090x2 ubunto box 100% GPU utlization in the garage to serving API calls. Ditch the dual boot.