What is the most capable model you can actually run on a single consumer GPU?
Posted by Longjumping-Bar-885@reddit | LocalLLaMA | View on Reddit | 47 comments
Not "what benchmarks the best" or "what has the most parameters." I mean in your actual daily use.
If you had to pick one model to run locally on something like a 4090 or 3090 and use for real work, what is your go-to?
I am curious about the gap between benchmark leaders and what is actually usable at decent context lengths without quantization artifacts making the output garbage.
What is your sweet spot for capability vs. hardware reality?
YoungSuccessful1052@reddit
For me it's either Qwen3.5 35B or Gemma 4 26B. The dense 27B and 31B are just way too slow for my liking and these two MoE are "good enough" at much higher speeds. But YMMW.
ObsidianIdol@reddit
If I have a 5090, will I be able to run the 27B versions?
YoungSuccessful1052@reddit
Sorry for the late reply. Yes 5090 runs 27B pretty nicely. I have a 3090 and that runs it around 40 tokens per second for generation so I assume the 5090 would at least run it at 50 which is in my opinion is completely usable.
dobkeratops@reddit
gemma4 makes sense but for 4090, 24gb .. i think you can run 27b's fast enough albeit with more quantisation to free up enough context window space ? i think i was seeing (gemma3) 27b q4 dense handle 40k context on 4090. I want to try qwen3.6's model at this size hearing all the enthusiasm .. Seems like the 27b dense is playing to this devices's strengths better than 35b-a4 ?
Lost-Health-8675@reddit
Jump to 3.6 wagon, you will enjoy the ride
YoungSuccessful1052@reddit
Yeah definitely. I'm a bit out of the loop so haven't got around to test 3.6 yet.
cviperr33@reddit
Qwen 3.6 35B moe fits its entire contex of 260k in single 24gb vram gpu , and on 3090 is rly fast 130-140tk/s , Next is Qwen 3.6 27b which also fits but at 100k contex and 30-40 tk/s
For daily use as my hermes agent / search and coding i would use the 35B moe 90% of the time, it is way better than anything else. For a very hard and specific coding job that we failed with the moe , i would then switch with the Dense.
Both of these came about a week ago , and nothing comes close to their perfomance/intelligence , maybe gemma 4 but why would you use that when you have qwen , maybe for specific tasks gemma outshines the qwen but i havent found such case yet.
Chupa-Skrull@reddit
Gemma beats Qwen in anything that has to do with writing English text that doesn't annoy the shit out of you. So for any kind of summarizer or non-code knowledge work agent it still wins (for me)
Middle_Bullfrog_6173@reddit
It beats Qwen even more so when writing most non-English languages. Chinese likely being an exception (I cannot evaluate).
But code and math are weaker.
a9udn9u@reddit
It writes better Chinese than Qwen3.5, but not than 3.6
Chupa-Skrull@reddit
That's interesting, I would've expected things to even out there instead. I'll have to keep that in mind. I also have no frame of reference on the Chinese, I'd be interested to hear from anyone who does
jld1532@reddit
Not really when it comes to rigged scientific writing though. Perhaps creative writing sure but my use case for Gemma right now is final check/backup to Qwen.
Chupa-Skrull@reddit
Yes really, universally in my experience. Happy whatever you've got working for you works for you though
Longjumping-Bar-885@reddit (OP)
This is exactly the split I'm doing too: MoE for daily stuff, dense for the nasty edge cases. I've been testing something that makes that back-and-forth way less painful, will DM you.
Sad-Savings-6004@reddit
what’s your setup to get the whole 35b and full context into 24gb vram? im at 160k context Q4_K_M and its 31gb even with q8 kv cache
cviperr33@reddit
i use unsloths IQ4 N L for 35b and at Q8 kv cache it just fits perfectly at max or near max , depends on the system overhead you have , with linux no problem.
For the dense , there was no UD iq4 at the time so i had to use the Q4 K M , which is similar size but i couldnt load it more than 100k at Q8 kv , i could probably def do it at Q4 kv but i dont like the speed and quality degredation
Sisuuu@reddit
Interesting! Are you using llama-swap or running both fully in two different gpus at the same time?
cviperr33@reddit
No for now im just using 2 different .sh script to start the models with llama.ccp , the dense model came out a day ago and i havent yet setup a automatic llama-swap route yet
BlobbyMcBlobber@reddit
Maybe using brutal quantization.
DependentBat5432@reddit
Switching between the MoR for daily tasks and routing the hard stuff through gateway is exactly what I ended up doing. went from paying $200 / month to barely spending anything
Prize_Negotiation66@reddit
qwen3.6-27b and gemma-4-31b
UnhingedBench@reddit
I've asked myself the same question for my MacBook (which I consider as consumer hardware with a 128GB GPU).
Terminator857@reddit
Get a strix halo and run qwen 3.5 122b Q4.
Look_0ver_There@reddit
Try to push it to Q5_M if you have a 128GB model. There's a marked improvement in intelligence with that jump.
segmond@reddit
I run all of them. KimiK2.6, GLM5.1, DeepSeekV3.2, Qwen3-397B.
Patiently of course. The key is RAM. I was fortunate to get 512gb of system ram before the crazy price jump. My only regret is that I bought slow ram 2400mhz. But if you can get fast ram, put it on at least an 8 channel system and pair up with a few GPUs, you can manage.
Jeidoz@reddit
Wow, I suppose you are running Q4 quants? Q8 or OG models may take up to 0.8-2TB memory. What is your average token generation speed? How much time takes to produce response with enabled thinking and available tools? I am curious if it even worth try them at RAM when "medium" 27b+ models exists and can fit into "medium+ gaming GPUs" (8-12GB+ VRAM).
segmond@reddit
Q4 for Kimi, Q5 for GLM5.1 & DeeepSeek, Q8 for Qwen-397B. token per second? Slooooow. I use them to plan, not for agentic work. Smaller models for agentic work, big models for deep thinking.
Lissanro@reddit
I also have 8-channel RAM (1 TB 3200 MHz in my case) - combined with VRAM, it is quite usable even with larger models. I noticed that 397B can have pretty decent performance even with RAM offloading (here I shared my performance for various models, tested with both llama.cpp and ik_llama.cpp), it is pretty good when I need to do something not too complicated but still requires vision.
That said, recently released Qwen 3.6 27B, also worth mentioning... it does not beat Qwen 3.5 397B in my tests, but comes quite close in many areas. So those who does not have much RAM but has sufficient VRAM, it can be a good model. I also use it sometimes with vLLM when I need high throughput or process videos.
No_Lingonberry1201@reddit
I have a 4060 w/ 8Gb VRAM and Qwen 3.6 35B A3B Q5_K_XL quant runs at \~20t/s for me.
Academic-Map268@reddit
It's running on your CPU. Your GPU is barely involved lol.
No_Lingonberry1201@reddit
It's running on the GPU, MoE only loads the active params. CPU inference is 5t/s on good days.
Academic-Map268@reddit
4B model when it fits in my vram: 34 t/s
MoE with 3.3B active when it doesn't fit in my vram: 8 t/s
No_Lingonberry1201@reddit
What configuration are you using? I can see that most of my VRAM is being used by llama.cpp and, as I said, I get 20t/s for the MoE and 5t/s at most for the 27B dense model.
Hodr@reddit
How much context are you running?
No_Lingonberry1201@reddit
Not much, around 64k.
arstarsta@reddit
First why not 5090 with 32gb RAM in your consumer definition?
NVIDIA RTX 6000 is a borderline where RTX should mean consumer like GeForce before.
DrDisintegrator@reddit
I like Gemma 4. I tried Qwen, but it made up too much stuff.
Academic-Map268@reddit
Yeah it does that. Even 3.6 plus with web search still makes up stuff.
Charming-Author4877@reddit
I've tested that a few days ago, invested 5-6 hours in total testing gemma, qwen 3.6 models and comparing them
Intro is there:
https://www.reddit.com/r/GithubCopilot/comments/1ss583x/i_am_not_switching_yet_but_i_tested_gemma4_and/
you'll find a link to the 3.6 comparison there too
There is nothing better on a consumer card atm
DeltaSqueezer@reddit
I'm running unquantized Qwen3.5-9B with about 140k of context on a 3090.
I found this to be a good balance of speed, context and intelligence.
Eyelbee@reddit
q4 27B models would fit inside your gpu and be vastly more intelligent. If that's too slow try 35B moe IQ4 xs. It'll be 5 times better than 9B and be extremely fast with a lot of context as well.
Gesha24@reddit
Check the 35B model (qwen3.6 since it's out). It can tolerate offloading some layers to RAM and keep decent performance.
Born-Caterpillar-814@reddit
Q3CN @q8
jacek2023@reddit
I use Gemma 26B for about a week now for agentic coding (OpenCode initially, now trying pi)
Technical-Earth-3254@reddit
The 27B Qwen 3.6 is very capable, even in quite low quants like iq4 xs (I'm using that on my 3090). It's decently fast (25-30tps), with q8 kv you can fit like 80k context. It all depends on what you do and what you need tbh. There are also other great models that fit on the card and are faster, but not as capable imo.
erwan@reddit
On huggingface you can create an account and tell it your hardware. It will show you which models (and which versions) can run on your hardware.
Intrepid_Dare6377@reddit
Qwen3.6 and Gemma 4 are also my go tos. Gemma flies. Qwen is a better rule follower.