What starts to become possible with two 3090s that wasn't with just one?
Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 83 comments
qwen 3.6 has been working great and has got me wondering.
Electronic-Space-736@reddit
same architecture so you can split across with Ollama, still not as good as a single card but double the headroom for RAM so a model twice as big, but slower.
jwpbe@reddit
this is completely wrong
Electronic-Space-736@reddit
I am not sure how this is wrong, I am doing this currently
jwpbe@reddit
If you have two graphics cards and you're using ollama it's like having a motorcycle with bicycle training wheels
if you used literally any other legitimate backend and not ollama you would understand how untrue this is. having tensor parallel enabled absolutely speeds up inference
Electronic-Space-736@reddit
also, I think your rage is mostly because you don't like ollama?
I use it, but I only use it via API, it is convenient for me.
You should not downvote because you have prejudice for something unrelated to a comment made.
FatheredPuma81@reddit
No Ollama is just objectively bad. Literally every important setting is hidden from you and you just encountered one of them ffs.
All you need to do to get like a 10x better experience is download the and unzip llama.cpp, open your cmd/terminal in that folder, and typing "llama-server -m (drag and drop your model here)".
Electronic-Space-736@reddit
fair enough, I do not notice these things as I have AI deploy and configure, I just have to remember to avoid the "ollama" keyword, it seems to trigger folk
Electronic-Space-736@reddit
we were not talking speed (I did mention slower) but yes twice the size model will run but slower - exactly what I said originally
FatheredPuma81@reddit
Ollama = bot comment btw.
Orlandocollins@reddit
There is this weird scaling of 4 factor I have noticed. One 24gb card opens doors. After that you need 4 to really have new options. From there you basically at an rtx pro 6000 and the next jump is again 4 rtx pro 6000.
Like don't get me wrong there is always a new model or quant at small vram bumps but the scale of 4 is the real showing of change.
Tired__Dev@reddit
It’s just cheaper to get an rtx6000 at that point then. 3090’s were cheap last summer, now they’re absolutely stupid prices in Canada.
GotHereLateNameTaken@reddit (OP)
Rtx6000 appears to be around 8k usd and used 3090s seem to be around 1k usd. It seems like a reasonable middle step to get a second 3090 given the almost step up in magnitude of price.
Orlandocollins@reddit
I get what you are saying but the number of times I have seen people get up to 4 3090s is pretty high. And I would much rather have one card that I can put in a pcie slot at 600 watts than deal with the problems you have trying to run 4. 4 is still cheaper than 1 rtx but you can slot the rtx pro into hardware easier and the speed boost is pretty significant.
sleepy_roger@reddit
It's still a massive cost difference. Even with a threadripper 256gb ddr4, 6 3090s, 2 psus, pcie extenders, ect I came in 2k under a single rtx 6000 which are around 9300 currently.
kaliku@reddit
I factored in the time and hassle to put everything together including research and also the hassle of keeping it running. Multiple failure points. Multi GPU configs. Lower bandwidth, with lower speed for prefill and tg but esp prefill. Power usage. Dust. Cat. Kids.
I went with a 6000.
Part of me wishes I had a monster rig if only as a conversation starter and geek swag 😂
GotHereLateNameTaken@reddit (OP)
Yeah so one 3090, then two, then a 6000, and the first two still have resale value.
EbbNorth7735@reddit
I ran with a 3090 and 4090 for a long time. It was great. With Agentic work you want lots of context space which 48GB gives you for dense models. The difference between 2k Canadian and 10k is realistically a lot. 6000's are up to 13k CAD as well. A couple months ago you could get one for around 10k. It being 3x the VRAM at 2x the price of a 5090 it was actually a decent deal. 96GB or 4x 3090 unlocks ~100B MOE models but does that give you anything more? Technically not very much. Qwen 3.5 122B is roughly equal to the 27B model. So yes, two 3090's can run a 27B dense fast but you're only missing the speed of the 122B MOE. At 2k for two 3090's it really is a lot of value.
UnethicalExperiments@reddit
I've been getting RTX 3060s when people aren't asking 400+ for used mining cards and not budging because enough idiots are paying that .( Seller in London ont who selling a crazy used mining setup for a premium)
Any other card with a decent amount of VRAM is going to be at least 650-800 here in Canada. Most of these " buy this expensive ass card instead of smaller units" solutions simply don't take into account how much getting hardware can suck outside of the US .
3090s -1200-1500 if you can find them 4090s - 2500+ 5090s - 5k
FullOf_Bad_Ideas@reddit
That's just not true. You unlock new options with each card.
Electronic-Space-736@reddit
You are honestly better going with something like this https://hilbert-agentic-computer.kckb.me/b06cccc2
Orlandocollins@reddit
So I guess what I am eating at is that if you want to run locally and you are currently at 24gb and want to move up, make a plan to get to 96gb
EbbNorth7735@reddit
Hold up, 2x24gb gives you 48 which is perfect for high context ~30B dense models, possibly with a bit of room to spare for TTS and STT models. 24GB struggles to hit high context.
Electronic-Space-736@reddit
I got downvoted for saying this
tehinterwebs56@reddit
Correct, also, kv cache is king in AI having a 35b model across two GPUs means you have sooo much context which makes life easier with long windows or with an agent like openclaw.
I have 2xp40s with open claw with qwen3.6:35b and I can also run other smaller models for the main agent to invoke when required.
Pretty good to have two, I was see if I only had two, I’d be a bit hamstrung.
FatheredPuma81@reddit
What models can you run at a factor of 4 that you can't at a factor of 3? All I can really find is Qwen3.5 122B tbh.
Pedalnomica@reddit
I think being able to go up to \~8 bit with decent context for the \~27-35B models is the biggest step up. I don't always have the greatest luck with the 4-bit quants.
Not to mention just having a little more headroom for, e.g. TTS, embedding...
GotHereLateNameTaken@reddit (OP)
Is this just because of model sizes or because of some hardware limitations sharing a model over a non-4 number of cards?
cakemates@reddit
model size, but i mean qwen 3.6 works a lot better with two 3090s, you can run big context and so on. I havent explored the limits yet. You can also run some smaller model at the same time, if you want audio for example.
MK_L@reddit
So on a 5090 I set the context to max (256k) and it was fine. I would expect x2 3090s to do better
FatheredPuma81@reddit
Basically more parallel agents and higher context is all you really get. At 3 RTX 3090s you can run Qwen3.5 122B UD-IQ4_NL and at 4 RTX 3090s you can run Minimax M2.7 230B UD-IQ3_XXS.
FormalAd7367@reddit
what’s your experience on running Minimax 230B on quad 3090s set up? my experience has not been great. When i squashed a 230B MoEs model down to under 3 bits per parameter, the model loses its nuance, its reasoning loops get muddy, and it starts making basic logic errors.
FatheredPuma81@reddit
I don't have 4 3090s lol I just looked at what models would fit in his VRAM. But my experience with the M2.5 Reap 130B at Q3_K_XL wasn't great, M2.7 at IQ1_M was fine enough for a Flappy Bird 1 shot, and I've heard someone say they run IQ2_XXS and are happy with it in OpenCode.
sleepy_roger@reddit
I needed 6 to make minimax worthwhile.
AdventurousSwim1312@reddit
Basically more speed, you'll get twice as many tokens, which is a nice perks for agentic use cases
perkia@reddit
Tripping a breaker
ProfessionalSpend589@reddit
Another US joke I don’t get.
perkia@reddit
I'm in Europe ;) Older GPUs have huge power usage. My overclocked 5090 Mobile (!) needs a third of the 350W a 3090 requires for basically the same performance.
Equivalent-Freedom92@reddit
European outlets won't cap out until at around 3.5kW
HadHands@reddit
Very old installations may have 10A breaker but it's still 2.3kW and would be fine for most home setups, and for all laptops. 3.7kW is standard where I live.
__some__guy@reddit
Not much.
70B models aren't really made anymore.
1 single 3090 either is enough, or you need 3/4 of them for the next size of models.
silenceimpaired@reddit
Not completely true, there was one related in the last six months, fine tunes are still being made, and for future readers depending on what you’re doing older 70b dense models still pack a punch at much higher speeds on 48gb VRAM. A 70b can behave similarly to a MoE model that requires 128gb.
That said, they are rare… and if you want agentic or coding commentator is right.
__some__guy@reddit
That's why I said "not much", rather than "nothing".
You can also run the new 100B+ models with 2/3-bit quants, however the 48GB VRAM range clearly isn't a sweet spot anymore.
silenceimpaired@reddit
I can agree with that.
munkiemagik@reddit
Heres a new one I learnt (apologies its not LLM related) - marginal stability. Just because something looks like its solid and stable and passing all stress tests - that's only for the exact current test conditions ie current physical environment included. But that doesn't mean it would be stable under new environmental conditions because that tested stabiltiy is on the edge of being stable.
I have a proxmox node that sits in the rack just above the LLM server and the heat from the GPUs I just figured out has highlighted a problem with my pve nodes VRM that i wasn't aware of!
kidflashonnikes@reddit
In Europe that’s illegal - in order to be complaint - you must let the gov know in advance when you use anything more than 100 watts - especially for AI. It’s very bad for the environment and anti migration. Just reported you to the Minister of Truth.
Prudent-Ad4509@reddit
You can use tensor parallelism inference, especially if you snatch nvlink for it. You can use one such box to run a competent small model like Qwen3.6 35B with a good quant and delegate most tasks to it instead of calling cloud models.
Having 4 is way better of course, but having 2 already means being able to get work done without resorting to low quants when using small models up to 35B. However, you still need to use low quants of models like 122b.
The next step is when you can use high quants of 122B, but still have to use a low quant if 200-400b models. That would require 4 to 8 GPUs depending on your situation and a very different motherboard/cpu combo. Just two is simple enough.
entsnack@reddit
You can teach yourself distributed ML using the HF ultrascale playbook! Learn about designing collectives and optimizing distributed training and inference workloads.
swingbear@reddit
I went full retard and bought 2 6000’s I guess the answer is the same here though, bigger models and much faster smaller models
FullOf_Bad_Ideas@reddit
You can run Devstral 2 123B at low bpw or GLM 4.5 Air / Qwen 3.5 122B A10B with more of the model in vram. When I had two 3090 tis my main model was GLM 4.5 Air 3.14bpw fully in VRAM.
Freonr2@reddit
Tensor parallel (i.e. vllm) speedup is genuine near ~2x decode rate for dense models. While still not as fast as an A3B, it feels significantly better for interactive use or bulk jobs with concurrency. I don't think you can underestimate that. Speed matters when you are actually trying to do something useful. There's no trade off besides the cost of the extra card here, it's just a lot faster.
Note that 2x3090s has as much memory bandwidth as a 5090 or RTX 6000 Blackwell. It won't actually be quite as fast, particularly for prefill, but it's certainly at least fast. I was actually very impressed when I was comparing 2x3090 to 1xRTX 6000 Blackwell myself.
48GB will allow you to push context size of ~30B models and/or concurrency.
48GB allows more concurrency for bulk/agentic tasks.
48GB allows larger quants.
Outside LLMs, 2x3090 also means you can skip a lot of VRAM hacks for some of the larger diffusion models, assuming you know how to assign model parts to different GPUs which admittedly isn't always easy.
I assume what you're getting as is there are no particular great models that are sized in a way where 48GB vs 24GB is some huge unlock, since there are many good ~27-32B models out there. It's partially true ignoring context length, concurrency, TP speed, and quant choice.
jwpbe@reddit
you can run models with vlllm / ik_llama with tensor parallel so you can use the compute of both cards to speed up inference.
you can run bigger models generally, with 64GB of dram and the two cards you can run something like stepfun 3.5 flash at 15 tokens per second decode with mid 100's token per second prompt processing.
i get 50 tokens per second decode on Qwen 27B instead of low 20's
AppealSame4367@reddit
I am a bit surprised. I would expect two 3090s to run at 60+ easily. Do you have more than 50 tps at low context?
jwpbe@reddit
i power limit the cards because the performance per watt falls off of a cliff quickly
AppealSame4367@reddit
Oh, ok. Do you use ngram speculative decoding?
truedima@reddit
Im likely in a similar boat to GP; 2x3090s at 300W max with 27B and 150k ctx usually on vllm tp=2. Gen is often low-ish around 20-30. Havent benchmarked shorter ctx for my usecases. So the biggest value for me atm is indeed moar ctx. Or a decent quant on one card and another model on another. But its def not some magical land of always 60tps gen for me so far.
jwpbe@reddit
give ik-llama a try? only 20-30 is off
truedima@reddit
ik-llama I havent tried yet, but exllama was just coredumping for me a few weeks back so I called it a day. otoh 20-30 on large ctx seemed not too off based on what I sometimes read. Do you get much higher on 2x3090 at 150k+ ctx (I use cyankiwis awq-int4 quant)
jwpbe@reddit
exllama doesn't have good support for the new qwen / gemma yet
i only have run the RYS Qwen 27B which is a little bigger so I have only been using Q5 K M gguf
GotHereLateNameTaken@reddit (OP)
Is this on two 3090s? Which quant size for the 27b model?
jwpbe@reddit
It's actually RYS Qwen 27B so it's a little bigger than normal. It's Q5_K_M. I can run full context with the vision component. Only reason I don't use vllm is nobody has made quants for it.
Individual_Spread132@reddit
Gemma 4 31B Q8_0 fits in perfectly, at 80 000 context size.
lemondrops9@reddit
With two you use full context with Qwen3.6 35B. Its quite nice. Still unsure if its better than Qwen3.5 122B.
cm8t@reddit
Larger quants of models you’d be able to run anyways. And tensor parallelism buys you quicker pre-processing/inference.
GotHereLateNameTaken@reddit (OP)
The discussion so far is helpful. So with two 3090s and so 48gb vram you can just load a single model across both? or is there some catch?
If so it seems like you would be able to run the 27b model at q8 that seems like a significant jump up in the case that 3.6 27b has similar progress as the 35b Moe model did. Is anyone doing this now with the 3.5 27b? What type of speeds do you get?
unoriginalwhitekid@reddit
You can absolutely run a model across 2 gpu. Using lm studio, you will get pretty much the same speed as if you only had one. With 48gb, you can run 27b at Q8 but anything above q6 or even Q5 is a bit of a waste. You can find graphs showing the degradation at each quant level and you’ll find that it only really starts showing up at Q4
jikilan_@reddit
Got any example of such claim? You make it sounds like those who running q8 is unnecessary.
jwpbe@reddit
You need to get an inference backend like ik_llama or vllm, or even just normal llama.cpp to take advantage of the hardware with tensor parallel. It's not 'twice as fast' in every case, but it will be much faster.
That's the catch, you need to use your brain and interact with a terminal. People saying 'its the same speed' have no idea what they're talking about or they're bots
TheOnlyBen2@reddit
Tensor splitting has recently been merged in llama.cpp main branch. However no graph mode yet
Alarming-Ad8154@reddit
Yes some tools run “tensor parallel” same model is on both cards and it’s like (almost) twice as fast. It’s great for the dense qwen3.5 27b model.
No-Manufacturer-3315@reddit
Run qwen3.6 at q8 with full 256k cache . That’s what you get with two cards! Lots of cache and full q8
psyclik@reddit
Obviously bigger models, longer context or less heavy quantization.
Past the obvious : better performance with vLLM TP or ik graph-parallel, more parallel requests (don’t fret on that if you go local agents).
And then, the ability to run multiple multiple models in parallel : llm on one gpu, ASR/TTS on another one, or diffusion or …
I run 4x3090 on my « master » server, but rarely run one big LLM split. It’s usually one LLM on one or two GPUs, and whatever my agents require (z-image, ltx, asr or tts) on the others, with llama-swap as frontend.
cviperr33@reddit
Well right now nothing , the 35b model fits perfectly on just 1 rtx 3090 with max contex , so going 2 would just be pointless , maybe if u want to run 2 agents in parrarel that would be cool.
When they release (if they) the bigger 130b model , thats when it becomes possible to you to load it and impossible for us with just 1 , with the new llama.ccp updates where it lets you load some experts to system ram without sacraficing much perfomance , i think it would be possible to load it and have it at decent 70-100 tk/s for agentic coding.
TurboFucker69@reddit
Even in that case you could run that 35b model at Q8 instead of Q4 with two 3090s. In my experience it’s a notable improvement, but not a dramatic one.
cviperr33@reddit
going above Q4 for model quants (not the ctk ctv) imo is a huuuge waste , its way slower and you cant fit a nice 200k contex (atleast on 24gb vram cards). I have not ever noticed a quality difference between Q4 - Q8 , even if there is , speed compensates for it (Unless you use it just for chatting and you dont mind waiting) .
If i had 2 3090 sure i would consider Q8 because what else i can do with that VRAM lol , but probably i would end up just running 2x q4 so i can have it work together and cross check each others code.
TurboFucker69@reddit
That was more or less my point (though I didn’t say it was a good point, lol).
I’ve noticed a difference in how well research agents follow instructions and perform their tasks, which seems to suffer with more quantization.
I don’t do a lot of vibe coding so I can’t speak to how much of a difference it makes there. I’d imagine coding-specific models would do better at coding when quantized than general models would under the same conditions, but that’s just conjecture based how relationships are encoded in an LLM’s parameters.
Also FWIW my inference speed at Q8 is about 70-80% of Q4. I’m running it in Apple Silicon (M5 Pro) with 64 GB of unified memory. It’s a bit slower, but having to deal with agents not performing well takes time too.
cviperr33@reddit
Yeah for vibe coding its mostly turn based , you see whats happening in real time and you instantly notice if something is wrong , also the compiler would not compile with broken code so it doesnt matter if the quality is slightly lower but faster tk/s means faster itteration so thats more valuable.
For research tho if you want it not to fail on the first time 100 times out of 100 times , yeah def the q8 quant and u have crazy mac with so much ram ofcourse u would load it at full accuracy :D . Soon in june new macs could overtake the 3090s
jwpbe@reddit
If I run Q6, I'll get 20% less tokens per second but the model only needs one shot / two shot instead of three / four + stops and starts like with a Q4 quant. Then the iteration isn't faster, its over twice as slow.
The degradation adds up over the course of a context window pretty rapidly.
alexp702@reddit
Tool calls noticeably fail more with q4 compared to 8. This ruins agentic flows. You can also see the difference in image processing quite starkly. A good Q8 is my personal quality floor
TurboFucker69@reddit
Yeah, this was the biggest difference I noticed between Q8 and Q4.
tecneeq@reddit
Running it with full context and higher precision.
illcuontheotherside@reddit
As someone who went from 1 3090 to 2... It was well worth it. But everyone's different and if you are satisfied with 1 then be satisfied.
For me I wanted the extra vram.. larger context windows.. larger models.. more experimenting.
I think if anything I'd double it.
From 2x24gb to 1x48gb or 1x96gb to scale out but obviously price points are blocking that.
End of the day if you're learning and having fun.. enjoy the ride!
MK_L@reddit
Are you using sli/nvlink?
wind_dude@reddit
Training in parallel