Best Model for single 3090 in 2026?
Posted by myusuf3@reddit | LocalLLaMA | View on Reddit | 85 comments
Running a single RTX 3090 (24GB VRAM) and looking for the best overall model in 2026 for coding + reasoning.
Main priorities:
- Strong code generation (Go/TypeScript)
- Good reasoning depth
- Runs comfortably in 24GB (quantized is fine)
- Decent latency on local inference
What are you all running on a single 3090 right now? Qwen? DeepSeek? Something else? Would love specific model names + quant setups.
fasti-au@reddit
35b qwen q6 and 27b for the architect swap in out is worth it
bolarius_art@reddit
i just setup an auto switching interface in vs code that uses deepseek locally and can call on CC for heavier workloads dynamically.
Goal is to run the local model for admin and research overnight, use cc during the day for dev tasks.
still very new and figuring it out but file architecture stuff so far has been hella interesting.
durden111111@reddit
How much RAM do you have? If 96GB+ then just download the largest MoE that will fit in that and load with llama cpp
OkBase5453@reddit
Hey, i just wanted to ask for pointers, I have rtx 3090+ 2x Xeon E52696v4 +512GB ddr4 2400 RAM.... I just do not know where to start, I need just for scripting and reading of large user manuals... just got the GPU, previously just on CPU got this: screenshot:
root@llama-cpp:\~# numactl --interleave=all /opt/ik_llama.cpp/build/bin/llama-bench -m /mnt/nvme-llm/models/Qwen3.5-27B.Q4_K_M.gguf -t 65,78 -p 1024
--mmap 0
| model | size | params | backend | threads | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ---: | ------------: | ---------------: |
======================================= HAVE_FANCY_SIMD is NOT defined
| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 65 | 0 | pp1024 | 42.88 ± 0.86 |
| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 65 | 0 | tg128 | 4.80 ± 0.09 |
| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 78 | 0 | pp1024 | 48.74 ± 0.30 |
| qwen35 27B Q4_K - Medium | 15.39 GiB | 26.90 B | CPU | 78 | 0 | tg128 | 5.23 ± 0.07
build: 1a7aa3e7 (4323)
OmarasaurusRex@reddit
I just got the qwen3 coder next 80b working on my 3090 after someone recently posted that the ud-iq3 variant is super smart
Its really awesome
Qwen3-Coder-Next-UD-IQ3_XXS.gguf
/app/llama-server --port ${PORT} -hf unsloth/Qwen3-Coder-Next-GGUF:UD-IQ3_XXS --fit on --main-gpu 0 --flash-attn on --ctx-size 32768 --cache-type-k q4_1 --cache-type-v q4_1 -np 1 --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --repeat-penalty 1.0 --metrics
This setup appears to use about 10gb of system ram
Approximate speeds on quick tests:
Performance Test Results Metric Value Prompt tokens 511 Completion tokens 1,470 Total tokens 1,981 Prompt speed 293.5 t/s Generation speed 29.5 t/s Wall time 51.6s Finish reason stop (natural)
Judge_OnReddit@reddit
I tried this too and pulled the same numbers until the context window grew. My CPU is an intel 7900X (a space heater pretending to be a CPU) and the tokens/s tanks for me, took almost 20mins to process a 3500 token prompt. Switched to Qwen3.5-35B-A3B-UD-Q3_K_XL - fits into VRAM, can fit 3 parallel kv cache at 262k context size and seems to be doing well coding so far...fingers crossed.
Insomniac24x7@reddit
Holy crap, I just tried this, this is incredible. Thank you!!
albertuki00@reddit
Which model and parameters (i.e context window) will be good to use with a single RTX3090, ollama as local llm provider and Claude agent set to use local models? I’ve use GLM 4.7 flash but it stucks sometimes or give time out or weird results
lmagusbr@reddit
there isn't one
myusuf3@reddit (OP)
what is everyone running here? 10K USD setups are bit ridiculous for such a fast moving space.
Prudent-Nebula-3239@reddit
10K is a lot and even that doesn't get you far, realistically you'll either need/want a $30-$70K setup and it'll depreciate hard. Best to wait a few more years before you spend serious $ on AI hardware
AlwaysLateToThaParty@reddit
My experience recently is the opposite of this. Infrastructure that I've acquired over the last few years has increased in price.
Prudent-Nebula-3239@reddit
Because it’s scarce right now. That’s why prices are high.
There’s a record-scale AI data center arms race happening globally. Hyperscalers and governments are locking up GPUs, HBM, advanced packaging, power capacity, the whole supply chain. That’s billions pouring in all at once. Supply can’t instantly match that.
You can imagine what that kind of capital does over time. Tech improves fast, manufacturing scales, efficiency jumps. Same pattern we saw with lithium-ion batteries in the early Tesla days. Early hardware looked scarce and expensive, then scale and competition drove rapid improvement.
Same with used cars during COVID. New production stalled, used prices spiked. Once production resumed, prices cooled.
When fabs ramp, packaging expands, and the next generation makes today’s hardware less efficient per dollar and per watt, prices normalize. Hardware doesn’t escape depreciation just because we’re in a hype cycle.
AlwaysLateToThaParty@reddit
You're going to need a few more data points than 'trust be bro', when the evidence of reality directly contradicts your assertions.
Prudent-Nebula-3239@reddit
Please englighten me
AlwaysLateToThaParty@reddit
You're the one making the assertions fella. I'm pointing at evidence that runs contrary to your opinion, so what are your assertions other than "just you wait"? That's the way critical thinking works.
Prudent-Nebula-3239@reddit
No you're just wasting my time
AlwaysLateToThaParty@reddit
Critical thinking is hard.
Prudent-Nebula-3239@reddit
For you it is apparently, you think that after all the new datacenters go online in the next year or two that your 0.00001% of "infrastructure" wont go down in price like everything else does on this earth? I gave you two excellent examples from the last 6 years so what more do you want?
If you wanna argue your irrelevant opinion then at least come with something
AlwaysLateToThaParty@reddit
Imaginary thinking because you want to believe something, ignoring evidence because it doesn't suit your desires.
Prudent-Nebula-3239@reddit
Ok now I know you're just ragebaiting lol
Gtfoh
AlwaysLateToThaParty@reddit
Even when something is explained to you in detail, your conditioning makes you reject it. Consume less social media. It will materially benefit your life.
Prudent-Nebula-3239@reddit
You haven't explained anything though... nice try Diddy, get a life
AlwaysLateToThaParty@reddit
Words are hard.
Prudent-Nebula-3239@reddit
Damn you must be fun at parties lol
eightysixmonkeys@reddit
Imagine the AI bottleneck just becomes TSMC
blbd@reddit
Unified memory. Or Claude Code / Codex subscriptions.
fulgencio_batista@reddit
Is there a way to get unified memory without Apple?
braydon125@reddit
Nvidia jetson!!!
ZioRob2410@reddit
I have a chance to buy an orin agx for 2k usd more or less. Have you tried that?
braydon125@reddit
My cluster has two 64gb dev kits
ZioRob2410@reddit
How many tps ? And which models are you running on those ?
braydon125@reddit
Dm and I'll respond after work
Polymorphic-X@reddit
NVIDIA DGX spark or AMD AI pro 395+ are non-apple options for unified memory.
Ryanmonroe82@reddit
Have you used a spark? It's very slow. Wouldn't advise at the moment for llms
fulgencio_batista@reddit
gawd damn i wish i was rich 🙏
blbd@reddit
DGX Spark and AMD Strix Halo aka Ryzen AI Max.
Pvt_Twinkietoes@reddit
Intel is working on them. But I'm not sure when we can see that in the market soon.
CaterpillarPrevious2@reddit
No for subscriptions! Local is the king!
Insomniac24x7@reddit
Im running a 3090 on a R9 7850X llama.cpp Llama-3.3-70B-Instruct-Q4_K_M.gguf and unfortunately performance was obysmal 3-4tokens/s.
overand@reddit
Yeah, try that at like 2 bits- maybe the Unsloth one @ UD-IQ2_XXS, it'll fit in your VRAM.
If you want to use a model that doesn't fit in your VRAM, you'll do best with a Mixture of Experts. Try GPT-OSS-20B and even 120B, you will probably be surprised with the performance of the 120B on a 3090! I was running that on a Ryzen 5 3600 system with DDR4 ram and one 3090, and it was surprisingly decent
Insomniac24x7@reddit
Thanks agaib got 30t/s with your guidance. Learning more
overand@reddit
Glad to help! With a single 3090, 70B models are always going to be a balance between "slow" and "need to use a low-bit quant." But, there are some good options still! And Qwen3-Next-80B-A3B-Instruct should also be quite fast - the 3B at the end means "3B active parameters" rather than a full 70B active like in the Llama 3.x ones.
Insomniac24x7@reddit
Yes im trying to dive deeper into that as we speak so I can understand a bit better.
Insomniac24x7@reddit
Confirming, just ran GPT-OSS-120B at 40t/s, amazing. Thanks again
Insomniac24x7@reddit
Thank you very much for this.
semangeIof@reddit
Well this model is at least 35GB in size, excluding all context, which means that only 68 percent of it (at most) is fitting into your VRAM. Why do you think it's slow?
4-bit quants are roughly 0.5GB per billion training parameters. Pick something that'll fit into your VRAM while having the allow for useful context.
Insomniac24x7@reddit
Yes absolutely. Im a noob at this so still playing around seeing trying to understand.
megadonkeyx@reddit
I couldn't get the 30b a3b models like glm and qwen to do anything useful. Even 80b qwen coder next was poor.
Just using -fitc and letting it sort itself out, its fast but totally bonkers. Not quant kv or anything.
Devstral2 small is the only model that actually made some code.
Hector_Rvkp@reddit
You want to run an MoE with active parameters and context strictly on the vram, and the rest of the model in ram. If that's ddr5, otherwise forget about it pretty much. It then becomes a question of how much ram you have, 96 or 128 will get you far enough, 64 not really. An LLM can help you pick, and check hugging face for quantized sizes of a given model. Don't go above q6, q5 is great, at Q4 you're starting to leave precision on the table but it can be worth it. Below that, unless the model is huge to begin with, tricky.
naripok@reddit
You can run qwen coder next (a 80b model) at Q4, full 260k context window, 500pp and 40tk/s generation in a single rtx 3090 with 64gb ddr4... It's not even difficult to do so... An one liner docker command to spin up a llama.cpp server does it all.
The internet is rotten.
Hector_Rvkp@reddit
Hmmm, can you though? Ddr4 bandwidth is really slow. PCI 3 or 4 is really slow. The 3090 is fast, but the active experts are constantly being swapped to generate tokens, and with that context size, most of the VRAM is holding the cache already.
naripok@reddit
No, hold on. I'm not saying that it runs at full context utilization at 40tk/s. 40tk/s is for 0-60k tokens context. I see how my phrasing can get ambiguous there.
That said, yeah, it runs at that speed on avg for my use as a software developer. This is very good for me, cuz it doesn't block me at my own usual execution speed. If you're delegating more of the work to the AI and not reviewing the code as much as I do, or if you think much faster than me, you may get blocked... Sure... It all depends on the use case.
Anyway, I wrote my comment to try to point out that older gen hardware is 100% up to the task for agentic coding, and that your comment makes it appear the opposite.
TheMotizzle@reddit
Qwen 3 coder next
FlexFreak@reddit
What quant do you recommend? I have been getting pretty bad pp with q3, cpu offloading and llama.cpp
social_tech_10@reddit
Qwen3-Coder-Next-MXFP4_MOE.gguf
TheMotizzle@reddit
I'm using this model on a 5090 and getting 70 tokens/sec. There's a chart floating around that shows the accuracy of the quants. On nvidia, mxfp4 does really well. Accuracy actually holds up pretty well down to Q3 apparently, but it is still usable all the way down to Q1 from what others have said. I've tried Q2 models that fit entirely in vram and got 140 tokens/sec. I asked for a logic test that would show accuracy between the quants. Q2 got the same result as mxfp4 so it held up. I had to tweak the startup options with chatgpt a bit to get here. It's hardware and use case specific. I started out at 5 tokens/sec.
SithLordRising@reddit
Test speeds between ollama and llama ccp.
Pretty easy to calc but need to know your CPU, RAM and available capacity.
Single_Ring4886@reddit
What are speeds and at which qwants when you need to offload it to normal RAM?
cristoper@reddit
Qwen3-Coder-30B-A3B at a 4-bit quant is fast and great for code completion
gpt-oss-20b and gpt-oss-120b (offloaded to RAM) are both good all-around models
gemma3-27b (QAT 4-bit quant) is also still a good general purpose model and better at prose than the gpt models
Iaann@reddit
I'm asking the same but I have 2 x 3090 side by side and 64gb ram.
Single_Ring4886@reddit
I read whole thread and pretty much state of things is only qwen models and glm flash are of some use in 2026 right? Which sadly align with my own experience.
midz99@reddit
I get about 40tokens /second qwen 3 coder 30b q4
AlwaysLateToThaParty@reddit
what's your impression of the capability of that model and quant? Is it useful?
HeatedFlamie@reddit
For a single RTX 3090 (24GB VRAM) in 2026, the following models are recommended for local coding and reasoning
MoneyPowerNexis@reddit
Dry_Yam_4597@reddit
...I...see...what...you...did...there.
Technical-Earth-3254@reddit
Qwen 3 Coder REAP 25B in Q6L runs perfect on mine. I also like the new Devstral Small 2. Ministral 14B reasoning is also quite strong and has vision. And Gemma 3 27B qat performs reasonably well for everything that isn't programming.
d4mations@reddit
I’m using ministral3-14b reasoning and it’s quite capable for what I need it for
Technical-Earth-3254@reddit
It definitely is. The vision encoder is great, also is in the instruct version. I'm using the Q6 UD quant as a web browser assistant and it's doing very well and is very quick on a 3090.
d4mations@reddit
Have you tried the reasoning version?
Freaker79@reddit
I can run alot if these models on my m1 max 64gb, but when using them in opencode or nsnocoder they break on simple tool calling. I have no issues outside the harnesses though...
DuanLeksi_30@reddit
Devstral small 2 24B 2512 instruct with unsloth UD Q4 K XL gguf is good. Remember to set temperature at 0.15. I use kv cache q8. (llama.cpp)
jax_cooper@reddit
I am planning to get a 3090 myself and planning to run qwen3:30b 4bit quant (about 19GB + context). There are instruct, coder and thinking models as well.
12bitmisfit@reddit
The byteshape releases are pretty good if you're trying to get high tps and ctx to squeeze into 24gb vram.
jax_cooper@reddit
wow, these look so promising, thank you!
tmvr@reddit
You can comfortably run both Qwen3 Coder 30B A3B and GLM 4.7 Flash in VRAM at Q4_K_XL, these will be very fast. You can also run the larger MoE models with good speed like Qwen3 Coder Next 80B of gpt-oss 120B, the speed on these will depend on what type of system RAM you have, with DDR5-4800 you get at least 25 tok/s or more, with DDR4 it will be slower of course.
CaterpillarPrevious2@reddit
I'm in the same space and I'm waiting for the M5 launch to see if that would be good enough to fit Qwen 3 as I have similiar requirements for coding and reasoning.
Admirable_Flower_287@reddit
Gemma 3 27B is still the best.
rainbyte@reddit
GLM-4.7-Flash and Qwen3-Coder-30B-A3B work fine with 24GB vram. I'm using both with IQ4_XS quant, they can do code generation and tool-calling.
There are other smaller models if you need SLMs for specific use cases. Take a look at LFM2.5, Ling-mini, Ernie, etc.
Weary_Long3409@reddit
This. A 24GB vram run easily GLM-4-7-Flash at 131k ctx with Q4_K_XL and q8_0 kv cache. Even GPT-OSS-20B-mxfp4 can achieve 524k ctx with q8_0 kv, so I get 4x parallel of 131k.
lundrog@reddit
Use case?
Ryanmonroe82@reddit
RNJ-1-Instruct in BF16
12bitmisfit@reddit
Mostly larger MoE models only partially loaded in vram. Qwen coder next, gpt OSS 120b, etc.
Present-Ad-8531@reddit
Not a local option, but qwen-code gives free 2k calls per day. Since you are asking for coding that would work no?