Best Model for single 3090 in 2026?

[-]

fasti-au@reddit

35b qwen q6 and 27b for the architect swap in out is worth it

[-]

i just setup an auto switching interface in vs code that uses deepseek locally and can call on CC for heavier workloads dynamically.
Goal is to run the local model for admin and research overnight, use cc during the day for dev tasks.
still very new and figuring it out but file architecture stuff so far has been hella interesting.

[-]

durden111111@reddit

How much RAM do you have? If 96GB+ then just download the largest MoE that will fit in that and load with llama cpp

[-]

OkBase5453@reddit

Hey, i just wanted to ask for pointers, I have rtx 3090+ 2x Xeon E52696v4 +512GB ddr4 2400 RAM.... I just do not know where to start, I need just for scripting and reading of large user manuals... just got the GPU, previously just on CPU got this: screenshot:

root@llama-cpp:\~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/nvme-llm/models/Qwen3.5-27B.Q4_K_M.gguf -t 65,78 -p 1024

--mmap 0

| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |

======================================= HAVE_FANCY_SIMD is NOT defined

| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 65 | 0 | pp1024 | 42.88 ± 0.86 |

| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 65 | 0 | tg128 | 4.80 ± 0.09 |

| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 78 | 0 | pp1024 | 48.74 ± 0.30 |

| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 78 | 0 | tg128 | 5.23 ± 0.07

build: 1a7aa3e7 (4323)

[-]

OmarasaurusRex@reddit

I just got the qwen3 coder next 80b working on my 3090 after someone recently posted that the ud-iq3 variant is super smart

Its really awesome

Qwen3-Coder-Next-UD-IQ3_XXS.gguf

/app/llama-server --port ${PORT} -hf unsloth/Qwen3-Coder-Next-GGUF:UD-IQ3_XXS --fit on --main-gpu 0 --flash-attn on --ctx-size 32768 --cache-type-k q4_1 --cache-type-v q4_1 -np 1 --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --repeat-penalty 1.0 --metrics

This setup appears to use about 10gb of system ram

Approximate speeds on quick tests:

Performance Test Results Metric Value Prompt tokens 511 Completion tokens 1,470 Total tokens 1,981 Prompt speed 293.5 t/s Generation speed 29.5 t/s Wall time 51.6s Finish reason stop (natural)

[-]

Judge_OnReddit@reddit

I tried this too and pulled the same numbers until the context window grew. My CPU is an intel 7900X (a space heater pretending to be a CPU) and the tokens/s tanks for me, took almost 20mins to process a 3500 token prompt. Switched to Qwen3.5-35B-A3B-UD-Q3_K_XL - fits into VRAM, can fit 3 parallel kv cache at 262k context size and seems to be doing well coding so far...fingers crossed.

[-]

Insomniac24x7@reddit

Holy crap, I just tried this, this is incredible. Thank you!!

[-]

albertuki00@reddit

Which model and parameters (i.e context window) will be good to use with a single RTX3090, ollama as local llm provider and Claude agent set to use local models? I’ve use GLM 4.7 flash but it stucks sometimes or give time out or weird results

[-]

lmagusbr@reddit

there isn't one

[-]

myusuf3@reddit (OP)

what is everyone running here? 10K USD setups are bit ridiculous for such a fast moving space.

[-]

Prudent-Nebula-3239@reddit

10K is a lot and even that doesn't get you far, realistically you'll either need/want a $30-$70K setup and it'll depreciate hard. Best to wait a few more years before you spend serious $ on AI hardware

[-]

AlwaysLateToThaParty@reddit

it'll depreciate hard.

My experience recently is the opposite of this. Infrastructure that I've acquired over the last few years has increased in price.

[-]

Prudent-Nebula-3239@reddit

Because it’s scarce right now. That’s why prices are high.

There’s a record-scale AI data center arms race happening globally. Hyperscalers and governments are locking up GPUs, HBM, advanced packaging, power capacity, the whole supply chain. That’s billions pouring in all at once. Supply can’t instantly match that.

You can imagine what that kind of capital does over time. Tech improves fast, manufacturing scales, efficiency jumps. Same pattern we saw with lithium-ion batteries in the early Tesla days. Early hardware looked scarce and expensive, then scale and competition drove rapid improvement.

Same with used cars during COVID. New production stalled, used prices spiked. Once production resumed, prices cooled.

When fabs ramp, packaging expands, and the next generation makes today’s hardware less efficient per dollar and per watt, prices normalize. Hardware doesn’t escape depreciation just because we’re in a hype cycle.

[-]

AlwaysLateToThaParty@reddit

You're going to need a few more data points than 'trust be bro', when the evidence of reality directly contradicts your assertions.

[-]

Prudent-Nebula-3239@reddit

Please englighten me

[-]

AlwaysLateToThaParty@reddit

You're the one making the assertions fella. I'm pointing at evidence that runs contrary to your opinion, so what are your assertions other than "just you wait"? That's the way critical thinking works.

[-]

Prudent-Nebula-3239@reddit

No you're just wasting my time

[-]

AlwaysLateToThaParty@reddit

Critical thinking is hard.

[-]

Prudent-Nebula-3239@reddit

For you it is apparently, you think that after all the new datacenters go online in the next year or two that your 0.00001% of "infrastructure" wont go down in price like everything else does on this earth? I gave you two excellent examples from the last 6 years so what more do you want?
If you wanna argue your irrelevant opinion then at least come with something

[-]

AlwaysLateToThaParty@reddit

Imaginary thinking because you want to believe something, ignoring evidence because it doesn't suit your desires.

[-]

Prudent-Nebula-3239@reddit

Ok now I know you're just ragebaiting lol
Gtfoh

[-]

AlwaysLateToThaParty@reddit

Even when something is explained to you in detail, your conditioning makes you reject it. Consume less social media. It will materially benefit your life.

[-]

Prudent-Nebula-3239@reddit

You haven't explained anything though... nice try Diddy, get a life

[-]

AlwaysLateToThaParty@reddit

Words are hard.

[-]

Prudent-Nebula-3239@reddit

Damn you must be fun at parties lol

[-]

eightysixmonkeys@reddit

Imagine the AI bottleneck just becomes TSMC

[-]

blbd@reddit

Unified memory. Or Claude Code / Codex subscriptions.

[-]

fulgencio_batista@reddit

Is there a way to get unified memory without Apple?

[-]

braydon125@reddit

Nvidia jetson!!!

[-]

ZioRob2410@reddit

I have a chance to buy an orin agx for 2k usd more or less. Have you tried that?

[-]

braydon125@reddit

My cluster has two 64gb dev kits

[-]

ZioRob2410@reddit

How many tps ? And which models are you running on those ?

[-]

braydon125@reddit

Dm and I'll respond after work

[-]

Polymorphic-X@reddit

NVIDIA DGX spark or AMD AI pro 395+ are non-apple options for unified memory.

[-]

Ryanmonroe82@reddit

Have you used a spark? It's very slow. Wouldn't advise at the moment for llms

[-]

fulgencio_batista@reddit

gawd damn i wish i was rich 🙏

[-]

blbd@reddit

DGX Spark and AMD Strix Halo aka Ryzen AI Max.

[-]

Pvt_Twinkietoes@reddit

Intel is working on them. But I'm not sure when we can see that in the market soon.

[-]

CaterpillarPrevious2@reddit

No for subscriptions! Local is the king!

[-]

Insomniac24x7@reddit

Im running a 3090 on a R9 7850X llama.cpp Llama-3.3-70B-Instruct-Q4_K_M.gguf and unfortunately performance was obysmal 3-4tokens/s.

[-]

overand@reddit

Yeah, try that at like 2 bits- maybe the Unsloth one @ UD-IQ2_XXS, it'll fit in your VRAM.

If you want to use a model that doesn't fit in your VRAM, you'll do best with a Mixture of Experts. Try GPT-OSS-20B and even 120B, you will probably be surprised with the performance of the 120B on a 3090! I was running that on a Ryzen 5 3600 system with DDR4 ram and one 3090, and it was surprisingly decent

[-]

Insomniac24x7@reddit

Thanks agaib got 30t/s with your guidance. Learning more

[-]

overand@reddit

Glad to help! With a single 3090, 70B models are always going to be a balance between "slow" and "need to use a low-bit quant." But, there are some good options still! And Qwen3-Next-80B-A3B-Instruct should also be quite fast - the 3B at the end means "3B active parameters" rather than a full 70B active like in the Llama 3.x ones.

[-]

Insomniac24x7@reddit

Yes im trying to dive deeper into that as we speak so I can understand a bit better.

[-]

Insomniac24x7@reddit

Confirming, just ran GPT-OSS-120B at 40t/s, amazing. Thanks again

[-]

Insomniac24x7@reddit

Thank you very much for this.

[-]

semangeIof@reddit

Well this model is at least 35GB in size, excluding all context, which means that only 68 percent of it (at most) is fitting into your VRAM. Why do you think it's slow?

4-bit quants are roughly 0.5GB per billion training parameters. Pick something that'll fit into your VRAM while having the allow for useful context.

[-]

Insomniac24x7@reddit

Yes absolutely. Im a noob at this so still playing around seeing trying to understand.

[-]

megadonkeyx@reddit

I couldn't get the 30b a3b models like glm and qwen to do anything useful. Even 80b qwen coder next was poor.

Just using -fitc and letting it sort itself out, its fast but totally bonkers. Not quant kv or anything.

Devstral2 small is the only model that actually made some code.

[-]

Hector_Rvkp@reddit

You want to run an MoE with active parameters and context strictly on the vram, and the rest of the model in ram. If that's ddr5, otherwise forget about it pretty much. It then becomes a question of how much ram you have, 96 or 128 will get you far enough, 64 not really. An LLM can help you pick, and check hugging face for quantized sizes of a given model. Don't go above q6, q5 is great, at Q4 you're starting to leave precision on the table but it can be worth it. Below that, unless the model is huge to begin with, tricky.

[-]

naripok@reddit

You can run qwen coder next (a 80b model) at Q4, full 260k context window, 500pp and 40tk/s generation in a single rtx 3090 with 64gb ddr4... It's not even difficult to do so... An one liner docker command to spin up a llama.cpp server does it all.

The internet is rotten.

[-]

Hector_Rvkp@reddit

Hmmm, can you though? Ddr4 bandwidth is really slow. PCI 3 or 4 is really slow. The 3090 is fast, but the active experts are constantly being swapped to generate tokens, and with that context size, most of the VRAM is holding the cache already.

[-]

naripok@reddit

No, hold on. I'm not saying that it runs at full context utilization at 40tk/s. 40tk/s is for 0-60k tokens context. I see how my phrasing can get ambiguous there.

That said, yeah, it runs at that speed on avg for my use as a software developer. This is very good for me, cuz it doesn't block me at my own usual execution speed. If you're delegating more of the work to the AI and not reviewing the code as much as I do, or if you think much faster than me, you may get blocked... Sure... It all depends on the use case.

Anyway, I wrote my comment to try to point out that older gen hardware is 100% up to the task for agentic coding, and that your comment makes it appear the opposite.

[-]

TheMotizzle@reddit

Qwen 3 coder next

[-]

FlexFreak@reddit

What quant do you recommend? I have been getting pretty bad pp with q3, cpu offloading and llama.cpp

[-]

social_tech_10@reddit

Qwen3-Coder-Next-MXFP4_MOE.gguf

[-]

TheMotizzle@reddit

I'm using this model on a 5090 and getting 70 tokens/sec. There's a chart floating around that shows the accuracy of the quants. On nvidia, mxfp4 does really well. Accuracy actually holds up pretty well down to Q3 apparently, but it is still usable all the way down to Q1 from what others have said. I've tried Q2 models that fit entirely in vram and got 140 tokens/sec. I asked for a logic test that would show accuracy between the quants. Q2 got the same result as mxfp4 so it held up. I had to tweak the startup options with chatgpt a bit to get here. It's hardware and use case specific. I started out at 5 tokens/sec.

[-]

SithLordRising@reddit

Test speeds between ollama and llama ccp.

Pretty easy to calc but need to know your CPU, RAM and available capacity.

[-]

Single_Ring4886@reddit

What are speeds and at which qwants when you need to offload it to normal RAM?

[-]

cristoper@reddit

Qwen3-Coder-30B-A3B at a 4-bit quant is fast and great for code completion

gpt-oss-20b and gpt-oss-120b (offloaded to RAM) are both good all-around models

gemma3-27b (QAT 4-bit quant) is also still a good general purpose model and better at prose than the gpt models

[-]

Iaann@reddit

I'm asking the same but I have 2 x 3090 side by side and 64gb ram.

[-]

Single_Ring4886@reddit

I read whole thread and pretty much state of things is only qwen models and glm flash are of some use in 2026 right? Which sadly align with my own experience.

[-]

midz99@reddit

I get about 40tokens /second qwen 3 coder 30b q4

[-]

AlwaysLateToThaParty@reddit

what's your impression of the capability of that model and quant? Is it useful?

[-]

HeatedFlamie@reddit

For a single RTX 3090 (24GB VRAM) in 2026, the following models are recommended for local coding and reasoning

[-]

MoneyPowerNexis@reddit

[-]

Dry_Yam_4597@reddit

...I...see...what...you...did...there.

[-]

Technical-Earth-3254@reddit

Qwen 3 Coder REAP 25B in Q6L runs perfect on mine. I also like the new Devstral Small 2. Ministral 14B reasoning is also quite strong and has vision. And Gemma 3 27B qat performs reasonably well for everything that isn't programming.

[-]

d4mations@reddit

I’m using ministral3-14b reasoning and it’s quite capable for what I need it for

[-]

Technical-Earth-3254@reddit

It definitely is. The vision encoder is great, also is in the instruct version. I'm using the Q6 UD quant as a web browser assistant and it's doing very well and is very quick on a 3090.

[-]

d4mations@reddit

Have you tried the reasoning version?

[-]

Freaker79@reddit

I can run alot if these models on my m1 max 64gb, but when using them in opencode or nsnocoder they break on simple tool calling. I have no issues outside the harnesses though...

[-]

DuanLeksi_30@reddit

Devstral small 2 24B 2512 instruct with unsloth UD Q4 K XL gguf is good. Remember to set temperature at 0.15. I use kv cache q8. (llama.cpp)

[-]

jax_cooper@reddit

I am planning to get a 3090 myself and planning to run qwen3:30b 4bit quant (about 19GB + context). There are instruct, coder and thinking models as well.

[-]

12bitmisfit@reddit

The byteshape releases are pretty good if you're trying to get high tps and ctx to squeeze into 24gb vram.

[-]

jax_cooper@reddit

wow, these look so promising, thank you!

[-]

tmvr@reddit

You can comfortably run both Qwen3 Coder 30B A3B and GLM 4.7 Flash in VRAM at Q4_K_XL, these will be very fast. You can also run the larger MoE models with good speed like Qwen3 Coder Next 80B of gpt-oss 120B, the speed on these will depend on what type of system RAM you have, with DDR5-4800 you get at least 25 tok/s or more, with DDR4 it will be slower of course.

[-]

CaterpillarPrevious2@reddit

I'm in the same space and I'm waiting for the M5 launch to see if that would be good enough to fit Qwen 3 as I have similiar requirements for coding and reasoning.

[-]

Admirable_Flower_287@reddit

Gemma 3 27B is still the best.

[-]

rainbyte@reddit

GLM-4.7-Flash and Qwen3-Coder-30B-A3B work fine with 24GB vram. I'm using both with IQ4_XS quant, they can do code generation and tool-calling.

There are other smaller models if you need SLMs for specific use cases. Take a look at LFM2.5, Ling-mini, Ernie, etc.

[-]