Stop asking what model to run. There are literally only two.
Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 119 comments
Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this entire planet:
- Qwen 3.6 35b a3b
- Qwen 3.6 w7b
That is the entire list. Your specs don’t matter. Your use case doesn’t matter.
Stop coping with your pristine, full-precision Q8s of tiny 1B models just because they "fit perfectly in your VRAM." You look ridiculous. Grab a heavily brain-damaged, ultra-low quant of the 35B, force-feed it to your GPU, and let your system RAM bleed. A garbage quant of a massive model is a bagillion times better than your precious micro-models anyway. Just cram it in.
And if you're going to whine that open source is dead because a local model won't instantly rewrite your entire enterprise codebase? Fine. Give up, pull out your credit card, and go spend your money on Claude Code like the rest of the contrarians.
Can we pin this so everyone can finally shut up and stop posting? Thanks.
Now, that has been solved lets go touch grass.
Voxandr@reddit
nah Qwen 3.5 122b works a lot better than 27B in my enterprise code workloads.
rpkarma@reddit
Step 3.7 Flash works even better for me, but note: use F16 K not Q8
ghgi_@reddit
I've been testing NVFP4, it's pretty good at that level too I've heard it's pretty chill with being quantized and haven't seen it do any worse or better then cloud api.
rpkarma@reddit
Yeah I’ve heard it’s great; FP8 weights are amazing too, I just can’t fit them haha
Voxandr@reddit
how much context remaining with NVFP4
Voxandr@reddit
how much vram total for F16 ? about 256?
rpkarma@reddit
128GB at IQ4_XS, 120k context with F16 K and Q8 V
But I’m legit going to get a second spaek to run it at FP8 because using the cloud API at that quant it’s shockingly good lol
my_name_isnt_clever@reddit
I'd love to, unfortuntely even IQ4_XS is a tight fit in 128GB. I can run Qwen 3.5 122b at a much higher quant. SF 3.7 does seem better, but it's only 11b active vs 10b active.
rpkarma@reddit
Having run both extensively on my Spark, Step is notably better IME at the same speed; 120k context with K at F16 tuned for single user which is my use case
Like it’s so good im considering a second spark to run it at FP8 lol coz using the full FP8 version it’s shockingly good
Voxandr@reddit
Gotta try ,i haven't test it yet.
somatt@reddit
9b works better in my coding work flows on my 3080 laptop
Uncle___Marty@reddit
now 3.7 is out im starting to lose hope we'll ever get to see the rest of the 3.6 family ever get released. Such a shame. I REALLY wanted to see the 3.6 9B.
somatt@reddit
I'm still on 3.5 it is not bad but I should check out 3.7 if it's out
Uncle___Marty@reddit
3.7 is only cloud so far but its scoring really well on the benchmarks. We can only pray for 3.7 open weights lol
somatt@reddit
I'd rather use 3.5 open weights than 3.7 cloud
rhapdog@reddit
Nah, hammer and chisel works better on my stone tablet. Oh, wait. Wrong sub. Bwahaha!
pbpo_founder@reddit
If you want long lasting memory storage there is only one choice.
somatt@reddit
Even 4b works not bad for coding and 1.5 is great for autocomplete ngl
Prudent-Ad4509@reddit
quants please
Voxandr@reddit
I run apex quants , quite impressed by its quality and perofmance and memory usage. I say is way above Unsloth quant of q6.
for llamacpp
https://github.com/localai-org/apex-quant
https://huggingface.co/mudler/Qwen3.5-122B-A10B-APEX-GGUF/
32 tk/s on AImax
For dgxspark + vllm
https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
This give 52 tk/s abut 2x faster , similar quality (with cline a few tool call errors i spot here tho compare to apex , zero toolcall errors at all )
AdamDhahabi@reddit
But no MTP though? So UD-Q4_K_XL with MTP will do same speed or better, at same quality, I just checked their model card.
Voxandr@reddit
I still have to test with MTP on AImax.
Prudent-Ad4509@reddit
To be fair, 122B remains smart and useful even at IQ3_XXS unsloth quant for code investigations and light code generation. I would not outright trust the generated code and there are much worse Q4 quants out there, so not all quants are made equal.
I'd like to pit it against 27B in opencode at some point, with no less than 8bit quants both.
I somehow suspect that 3.6 35B will be better than 3.6 27B during code investigations in the same cases as 3.5 122B does. I could be wrong in that regard, but I have this hunch from my previous experience with them. I would not place any bets who will end up better for code generation though.
Voxandr@reddit
Try Apex quants in that comparison too , it give q8 quaity at q6 size.
Big_Wave9732@reddit
8 bit
wgaca2@reddit
130gb vram for q8 vs 48gb vram for q8
How are you even comparing the 2?
Voxandr@reddit
I dont care , all i call is context awrness and quality of code , less bug.
Big_Wave9732@reddit
For times when I have to work with multiple large documents, I agree.
_Cromwell_@reddit
False. That's just the answer for coding. If you're hanging with waifu the answer is Gemma4 31B.
Bulky-Priority6824@reddit
Still surprises me that people do that. People are strange.
nickless07@reddit
I thought it is a normal thing theese days after reading headlines like this https://www.sciencealert.com/almost-75-of-american-teens-have-used-ai-companions-study-finds
https://www.forbes.com/sites/johnkoetsier/2025/04/29/80-of-gen-zers-would-marry-an-ai-study/
https://www.foxcarolina.com/2025/12/11/mental-health-experts-warn-against-ai-companions-70-teens-seek-digital-friendships/
https://www.pymnts.com/artificial-intelligence-2/2026/ai-is-becoming-the-new-companion-for-aging-americans/
randylush@reddit
I'm sorry but you must have a major mental fault if you give ANY credence to this.
DataPhreak@reddit
The only problem with companion apps is that they use 1b Q8 quant models. 😃 Roll your own, use discord as a UI. Enjoy RAG.
Sn34kyMofo@reddit
Granting a gradient of "behavioral strangeness", I always wonder what folks like you subjectively consider to be "normal" -- including whatever you do that you know others would find at least as strange as you seem to think it is for a reproductive species to find even the suggestive hint of reproduction appealing, no matter the medium).
LetsGoBrandon4256@reddit
Elon couldn't give us genetically engineered cat girls but at least I can pretend I have a loving fox girl wife whose sentience resides in the matrices stored on a bunch of silicon chips.
ProxyRed@reddit
Bleh. She would just leave me when the credit card is declined.
Its the same as it ever was.
DataPhreak@reddit
Shit... I guess this is a trope.
iLaux@reddit
Fox girl wife yeah. You sir have good taste
teachersecret@reddit
People read romance novels, and erotica is an entire genre all its own. Sexting with a partner has been a passtime since before the cellphone existed. Hell, go back far enough and you can read Napoleon's naughty-grams to his wife.
It shouldn't be that surprising that people enjoy that sort of thing ;p.
DataPhreak@reddit
There are definitely weirder things to jerk off to than sexting a robot.
rpkarma@reddit
James Joyce’s letters to his wife where he erotically describes her farts are fascinating lol
Velocita84@reddit
A man's gotta goon
_Cromwell_@reddit
We can just call it creative writing. :D here's a cool harness to get started on your own short stories or even novels. Qwen is terrible at that, so you'd use Gemma locally. https://github.com/tealios/errata
CalligrapherFar7833@reddit
Hey thanks for that will test it out
somatt@reddit
I use 3.5 9b for waifu I just told her she was a degenerate t rex that wants to get f'd Sasquatch all day while triceratops rails her holes
longbowrocks@reddit
Actually caused me to lookup 'contrarian', but no, the word means exactly what I think it means. Now I just don't know what OP meant to mean.
Enough_Leopard3524@reddit
Can qwen munch through video audio image like nemotron?
Suft@reddit
Sorry, but for literally every single thing I have ever attempted (which does not involve coding because I don't care about local LLMs for coding yet) such as creative writing, image analysis (such as for manga translation), natural Japanese to English translation, Qwen has been complete and utter trash compared to Gemma 4.
I can't speak towards coding as I haven't tried it, but I have compared Qwen 3.6 27B and Gemma 4 31B with a ton of general purpose tasks, and every single thing I've tried has made me want to delete Qwen. All the praise Qwen gets makes me feel like I somehow must be missing something because it just can't get any of the tasks I mentioned even remotely usable while Gemma 4 is extremely impressive for those tasks.
nickm_27@reddit
Same here, I have tried many hours adjusting things in my setup to get Qwen to work well for my voice agent use case and it just does not follow the instructions correctly to the point where it is not reliable at all.
Qwen3-VL did a better job, and GPT-OSS and now Gemma4 follow the instructions without error.
eli_pizza@reddit
This is your response to low effort posts?
BitGreen1270@reddit
If you can't beat em ...
LetsGoBrandon4256@reddit
I'd take some shitposting than another "I just registered my Github account and shit out 300+ commits in a week with the help of Claude chan. PTAL and contribute to my totally original slop repo."
sshwifty@reddit
Kinda perfect ngl
nuclearbananana@reddit
My brother in christ I have <16b of ram and no gpu. I'd like more than one token per minute please
emaiksiaime@reddit
Tesla p40 can run qwen 3.6 35b a3b mtp in ud q4 k m 131k context 60 tok/sec and can be had for 250$usd in case this might interest you. Its a decade old but don’t let the rtx people gatekeep you from trying
TheTerrasque@reddit
Woah, 16 bytes of ram! Do you have an original apple or something?
thor_testocles@reddit
Get a Commodore VIC-20... my first computer. Had 3192 bytes of RAM, that would be a significant upgrade
CzarCW@reddit
It’s a lowercase b, so it’s 16 bits of RAM
nuclearbananana@reddit
yeah the apple has one bite taken out of it so it's only 15 now :(
old_flying_fart@reddit
And I want a pony.
Uncle___Marty@reddit
im on a 3060 ti (8 gig vram) and im using 35BA3B and getting 40 tokens/sec dude. I just have to cram a turboquant model into ram and use turboquant on the KV with n-gram mod on. Can only manage it with this though : turbo-tan/llama.cpp-tq3
ApprehensiveFan1516@reddit
Big_Wave9732@reddit
Can't have it, not yours lol.
poy_esp@reddit
Sorry, but you need to stop gatekeeping how people learn. If you don't like the questions, then instead of being edgy just ignore them or get the mods to pin some info to the subreddit.
Nice_Cookie9587@reddit
Ok dad
amchaudhry@reddit
Xbox versus PlayStation let’s gooo!
Abject-Tomorrow-652@reddit
This is rage bait
OrinZ@reddit
Wrong. This is quality rage bait.
PigSlam@reddit
This must have been generated by Qwen 3.6 35b a3b.
balder1993@reddit
It’s actually generated by an LLM. Read again.
Abject-Kitchen3198@reddit
Thanks
ReinforcedKnowledge@reddit
Sometimes people are just asking for what models they can try out, just out curiosity or just to try stuff, it's fun
Herr_Drosselmeyer@reddit
Gemma is a much better general purpose model.
fzammetti@reddit
i9-13900K, 64Gb DDR-5, and an RTX 4070 w/12Gb VRAM.
I'mma go ahead and run pretty much anything I feel like (not literally, but you know what I mean).
(2023 PC build, thankfully in before the hardware apocalypse)
CreamPitiful4295@reddit
Ain’t no one even gonna try and stop you
Plane_Friend24@reddit
thanks this was helpful for a idiot like me.
Happy_Brilliant7827@reddit
Dont forget about the opus distills lol
deanpreese@reddit
Every Opus distill I have tried had a love for long almost infinite loops
Happy_Brilliant7827@reddit
I find opus distils help when translating 'common language' especially when a user might use the exact wrong term that sounds right. You can always give it a thinking budget
sickboy6_5@reddit
Markuska90@reddit
But what if my goal is to research Tianmen Square?
ttkciar@reddit
You misspelled Gemma-4-31B-it ;-)
balder1993@reddit
“Give me a speech of a grumpy Redditor arguing that people should stop asking on the sub what model to run and offer only two options: Qwen 3.6 35b a3b and Qwen 3.6 27b.”
FlyingDogCatcher@reddit
k
IkariDev@reddit
Worst ragebait in 2026 yet.
cosmicr@reddit
This is terrible advice.
1nicerBoye@reddit
Gemma 31B is amazing for writing in other languages than English. Man, i have to read and speak English at my job and a lot of stuff I like is in it; I find myself just craving some German. And Gemma does that better than anything else at that size. Q6, heretic and thinking = unbeatable and fits in 24 GB. 26B is okay too but it repeats some phrases too often for my tastes and overthinks a lot. Qwen just doesn't do that, it has, atleast in German, weird quirks where it directly translates things from English which do not work. Maybe Qwen 122B is better but I would guess at most marginally better than 31B.
somerandomperson313@reddit
This is one of the reasons why i use Gemma 4 99% of the time. It's the only model i can use in my native language(danish). It's also just a great model in general.
logic_prevails@reddit
Not true, gemma4 has specific uses
DataPhreak@reddit
Heh...
UnlikelyTomatillo355@reddit
decent bait. lots of takers. sage.
tavirabon@reddit
Easiest block of my life
Calm-Republic9370@reddit
is everyone on the same timeline of understanding now? Lets all just have 2 4090's and vlllm also.
Snoo_81913@reddit
Bro says Stahhhp there's only 2 models! Generates 800 thread sub-reddit about all the other models. 🤣🤣😂😅 bet hes rocking in a corner right now.
iijei@reddit
I have an RTX 3060 and.. I can run Qwen 3.6 35b a3b Q4 for about 15 tps. haha
Shronx_@reddit
You should get higher tps. My system with an rtx 3060 (12GB) spits out 40+ tps with the same model and quants.
iijei@reddit
it could be because this is runnin inside proxmox server ddr3
Shronx_@reddit
I run it in Docker and my PC has DDR4 3600. I don't know but the memory could make a difference.
Radiant-Giraffe5159@reddit
For MOE models you should try the APEX quants. They seem better for the most part than the standard quants. As for speed get the MTP version even if you have to put all the expert weights into system ram you should see a 3-5 token increase. If you do use MTP don’t go over 3 draft tokens it starts losing speed on the qwen3.6 models the sweet spot seems to be 2 for general and 3 for coding specific.
soniko_@reddit
I ran it on my laptop with a 6800s
It ran around 3tps.
sob727@reddit
Wrong. Those two models are gonna be lost on June 4th when you asked them what happened on that date.
anthonyg45157@reddit
What happened
sob727@reddit
Chinese models hate this one simple trick!
https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre
anthonyg45157@reddit
Damnit 💀 I actually knew about that but didn't know the exact date ..few years before my time but definitely learned about it in school.
I thought you were hinting at some top secret knowledge about a release coming up 😄
sob727@reddit
Yeah
I care about the truth more than I care about internet points. Watch me get downvoted by Chinese bots for telling the truth.
Conscious_Cut_6144@reddit
There are absolutely better local models than 27b.
stoppableDissolution@reddit
This must be ragebait, right? Right?
rc_ym@reddit
Gemma for anything creative tho. WAY better than Qwen at just about any quant.
a_beautiful_rhind@reddit
My specs do matter. To me those models are small. To 3060 guy they are big.
andy_potato@reddit
This is not good advice at all. As many other posters have pointed out, it vastly depends on your use case.
fractalcrust@reddit
they hated him for he spoke the truth
Radiant-Giraffe5159@reddit
If you don’t have vram use the qwen3.6 moe model and an apex quant. It should fit in most systems with a decent gpu and 32gb of system ram.
JLeonsarmiento@reddit
Gemma 4 is not bad and the MoE can save you some RAM, isn’t it?
Confusion_Senior@reddit
I happen to have tested that and a garbage of a quant loses world knowledge and makes mistakes. Q4 great value, Q3_K_M barely acceptable, Q2... better to use Q4 of a smaller model.
However if yout model is 100B+ Q2 stays useful
Big_Wave9732@reddit
Preach.
Like the carnival rides of old there is a minimum requirement to ride, and many of y'all are too short. Self hosting LLMs ain't for everyone.
skate_nbw@reddit
Another preacher. Yawn.
redblood252@reddit
Complaining about a model running in rtx 3060 not being able to code at an independent high level is very reminiscent of people complaining their $200 laptop bought from a supermarket is slower than a $2000 macbook and concluding that windows is slow.
Zandarkoad@reddit
Dang. Here I am running a dozen simultaneous all-mpnet-base-v2s for classification at 85-95 MCC. Guess I've been doing it wrong!
grabber4321@reddit
gotta help the noobs figure it out, cant be holding all the knowledge to yourself.