Stop asking what model to run. There are literally only two.

Posted by Wrong_Mushroom_7350@reddit | LocalLLaMA | View on Reddit | 119 comments

Can we please ban the daily "I have an RTX 3060, what should I run?" slop threads? It’s not complicated. As of right now, Hugging Face is empty and exactly two local models exist on this entire planet:

Qwen 3.6 35b a3b
Qwen 3.6 w7b

That is the entire list. Your specs don’t matter. Your use case doesn’t matter.

Stop coping with your pristine, full-precision Q8s of tiny 1B models just because they "fit perfectly in your VRAM." You look ridiculous. Grab a heavily brain-damaged, ultra-low quant of the 35B, force-feed it to your GPU, and let your system RAM bleed. A garbage quant of a massive model is a bagillion times better than your precious micro-models anyway. Just cram it in.

And if you're going to whine that open source is dead because a local model won't instantly rewrite your entire enterprise codebase? Fine. Give up, pull out your credit card, and go spend your money on Claude Code like the rest of the contrarians.

Can we pin this so everyone can finally shut up and stop posting? Thanks.

Now, that has been solved lets go touch grass.

[-]

Voxandr@reddit

nah Qwen 3.5 122b works a lot better than 27B in my enterprise code workloads.

[-]

rpkarma@reddit

Step 3.7 Flash works even better for me, but note: use F16 K not Q8

[-]

ghgi_@reddit

I've been testing NVFP4, it's pretty good at that level too I've heard it's pretty chill with being quantized and haven't seen it do any worse or better then cloud api.

[-]

rpkarma@reddit

Yeah I’ve heard it’s great; FP8 weights are amazing too, I just can’t fit them haha

[-]

Voxandr@reddit

how much context remaining with NVFP4

[-]

Voxandr@reddit

how much vram total for F16 ? about 256?

[-]

rpkarma@reddit

128GB at IQ4_XS, 120k context with F16 K and Q8 V

But I’m legit going to get a second spaek to run it at FP8 because using the cloud API at that quant it’s shockingly good lol

[-]

my_name_isnt_clever@reddit

I'd love to, unfortuntely even IQ4_XS is a tight fit in 128GB. I can run Qwen 3.5 122b at a much higher quant. SF 3.7 does seem better, but it's only 11b active vs 10b active.

[-]

rpkarma@reddit

Having run both extensively on my Spark, Step is notably better IME at the same speed; 120k context with K at F16 tuned for single user which is my use case

Like it’s so good im considering a second spark to run it at FP8 lol coz using the full FP8 version it’s shockingly good

[-]

Voxandr@reddit

Gotta try ,i haven't test it yet.

[-]

somatt@reddit

9b works better in my coding work flows on my 3080 laptop

[-]

Uncle___Marty@reddit

now 3.7 is out im starting to lose hope we'll ever get to see the rest of the 3.6 family ever get released. Such a shame. I REALLY wanted to see the 3.6 9B.

[-]

somatt@reddit

I'm still on 3.5 it is not bad but I should check out 3.7 if it's out

[-]

Uncle___Marty@reddit

3.7 is only cloud so far but its scoring really well on the benchmarks. We can only pray for 3.7 open weights lol

[-]

somatt@reddit

I'd rather use 3.5 open weights than 3.7 cloud

[-]

rhapdog@reddit

Nah, hammer and chisel works better on my stone tablet. Oh, wait. Wrong sub. Bwahaha!

[-]

pbpo_founder@reddit

If you want long lasting memory storage there is only one choice.

[-]

somatt@reddit

Even 4b works not bad for coding and 1.5 is great for autocomplete ngl

[-]

Prudent-Ad4509@reddit

quants please

[-]

Voxandr@reddit

I run apex quants , quite impressed by its quality and perofmance and memory usage. I say is way above Unsloth quant of q6.
for llamacpp
https://github.com/localai-org/apex-quant
https://huggingface.co/mudler/Qwen3.5-122B-A10B-APEX-GGUF/
32 tk/s on AImax

For dgxspark + vllm

https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4

This give 52 tk/s abut 2x faster , similar quality (with cline a few tool call errors i spot here tho compare to apex , zero toolcall errors at all )

[-]

AdamDhahabi@reddit

But no MTP though? So UD-Q4_K_XL with MTP will do same speed or better, at same quality, I just checked their model card.

[-]

Voxandr@reddit

I still have to test with MTP on AImax.

[-]

Prudent-Ad4509@reddit

To be fair, 122B remains smart and useful even at IQ3_XXS unsloth quant for code investigations and light code generation. I would not outright trust the generated code and there are much worse Q4 quants out there, so not all quants are made equal.

I'd like to pit it against 27B in opencode at some point, with no less than 8bit quants both.

I somehow suspect that 3.6 35B will be better than 3.6 27B during code investigations in the same cases as 3.5 122B does. I could be wrong in that regard, but I have this hunch from my previous experience with them. I would not place any bets who will end up better for code generation though.

[-]

Voxandr@reddit

Try Apex quants in that comparison too , it give q8 quaity at q6 size.

[-]

Big_Wave9732@reddit

8 bit

[-]

wgaca2@reddit

130gb vram for q8 vs 48gb vram for q8

How are you even comparing the 2?

[-]

Voxandr@reddit

I dont care , all i call is context awrness and quality of code , less bug.

[-]

Big_Wave9732@reddit

For times when I have to work with multiple large documents, I agree.

[-]

_Cromwell_@reddit

False. That's just the answer for coding. If you're hanging with waifu the answer is Gemma4 31B.

[-]

Bulky-Priority6824@reddit

Still surprises me that people do that. People are strange.

[-]

nickless07@reddit

I thought it is a normal thing theese days after reading headlines like this https://www.sciencealert.com/almost-75-of-american-teens-have-used-ai-companions-study-finds
https://www.forbes.com/sites/johnkoetsier/2025/04/29/80-of-gen-zers-would-marry-an-ai-study/
https://www.foxcarolina.com/2025/12/11/mental-health-experts-warn-against-ai-companions-70-teens-seek-digital-friendships/
https://www.pymnts.com/artificial-intelligence-2/2026/ai-is-becoming-the-new-companion-for-aging-americans/

[-]

randylush@reddit

80% of Gen Zers say they would marry an AI, according to a study by AI chatbot company Joi AI.

I'm sorry but you must have a major mental fault if you give ANY credence to this.

[-]

DataPhreak@reddit

The only problem with companion apps is that they use 1b Q8 quant models. 😃 Roll your own, use discord as a UI. Enjoy RAG.

[-]

Sn34kyMofo@reddit

Granting a gradient of "behavioral strangeness", I always wonder what folks like you subjectively consider to be "normal" -- including whatever you do that you know others would find at least as strange as you seem to think it is for a reproductive species to find even the suggestive hint of reproduction appealing, no matter the medium).

[-]

LetsGoBrandon4256@reddit

Elon couldn't give us genetically engineered cat girls but at least I can pretend I have a loving fox girl wife whose sentience resides in the matrices stored on a bunch of silicon chips.

[-]

ProxyRed@reddit

Bleh. She would just leave me when the credit card is declined.

Its the same as it ever was.

[-]

DataPhreak@reddit

Shit... I guess this is a trope.

[-]

iLaux@reddit

Fox girl wife yeah. You sir have good taste

[-]

teachersecret@reddit

People read romance novels, and erotica is an entire genre all its own. Sexting with a partner has been a passtime since before the cellphone existed. Hell, go back far enough and you can read Napoleon's naughty-grams to his wife.

It shouldn't be that surprising that people enjoy that sort of thing ;p.

[-]

DataPhreak@reddit

There are definitely weirder things to jerk off to than sexting a robot.

[-]

rpkarma@reddit

James Joyce’s letters to his wife where he erotically describes her farts are fascinating lol

[-]

Velocita84@reddit

A man's gotta goon

[-]

_Cromwell_@reddit

We can just call it creative writing. :D here's a cool harness to get started on your own short stories or even novels. Qwen is terrible at that, so you'd use Gemma locally. https://github.com/tealios/errata

[-]

CalligrapherFar7833@reddit

Hey thanks for that will test it out

[-]

somatt@reddit

I use 3.5 9b for waifu I just told her she was a degenerate t rex that wants to get f'd Sasquatch all day while triceratops rails her holes

[-]

longbowrocks@reddit

...and go spend your money on Claude Code like the rest of the contrarians.

Actually caused me to lookup 'contrarian', but no, the word means exactly what I think it means. Now I just don't know what OP meant to mean.

[-]

Enough_Leopard3524@reddit

Can qwen munch through video audio image like nemotron?

[-]

Suft@reddit

Sorry, but for literally every single thing I have ever attempted (which does not involve coding because I don't care about local LLMs for coding yet) such as creative writing, image analysis (such as for manga translation), natural Japanese to English translation, Qwen has been complete and utter trash compared to Gemma 4.

I can't speak towards coding as I haven't tried it, but I have compared Qwen 3.6 27B and Gemma 4 31B with a ton of general purpose tasks, and every single thing I've tried has made me want to delete Qwen. All the praise Qwen gets makes me feel like I somehow must be missing something because it just can't get any of the tasks I mentioned even remotely usable while Gemma 4 is extremely impressive for those tasks.

[-]

nickm_27@reddit

Same here, I have tried many hours adjusting things in my setup to get Qwen to work well for my voice agent use case and it just does not follow the instructions correctly to the point where it is not reliable at all.

Qwen3-VL did a better job, and GPT-OSS and now Gemma4 follow the instructions without error.

[-]

eli_pizza@reddit

This is your response to low effort posts?

[-]

BitGreen1270@reddit

If you can't beat em ...

[-]

LetsGoBrandon4256@reddit

I'd take some shitposting than another "I just registered my Github account and shit out 300+ commits in a week with the help of Claude chan. PTAL and contribute to my totally original slop repo."

[-]

sshwifty@reddit

Kinda perfect ngl

[-]

nuclearbananana@reddit

My brother in christ I have <16b of ram and no gpu. I'd like more than one token per minute please

[-]

emaiksiaime@reddit

Tesla p40 can run qwen 3.6 35b a3b mtp in ud q4 k m 131k context 60 tok/sec and can be had for 250$usd in case this might interest you. Its a decade old but don’t let the rtx people gatekeep you from trying

[-]

TheTerrasque@reddit

Woah, 16 bytes of ram! Do you have an original apple or something?

[-]

thor_testocles@reddit

Get a Commodore VIC-20... my first computer. Had 3192 bytes of RAM, that would be a significant upgrade

[-]

CzarCW@reddit

It’s a lowercase b, so it’s 16 bits of RAM

[-]

nuclearbananana@reddit

yeah the apple has one bite taken out of it so it's only 15 now :(

[-]

old_flying_fart@reddit

And I want a pony.

[-]

Uncle___Marty@reddit

im on a 3060 ti (8 gig vram) and im using 35BA3B and getting 40 tokens/sec dude. I just have to cram a turboquant model into ram and use turboquant on the KV with n-gram mod on. Can only manage it with this though : turbo-tan/llama.cpp-tq3

[-]

ApprehensiveFan1516@reddit

[-]

Big_Wave9732@reddit

Can't have it, not yours lol.

[-]

poy_esp@reddit

Sorry, but you need to stop gatekeeping how people learn. If you don't like the questions, then instead of being edgy just ignore them or get the mods to pin some info to the subreddit.

[-]

Nice_Cookie9587@reddit

Ok dad

[-]

amchaudhry@reddit

Xbox versus PlayStation let’s gooo!

[-]

Abject-Tomorrow-652@reddit

This is rage bait

[-]

OrinZ@reddit

Wrong. This is quality rage bait.

[-]

PigSlam@reddit

This must have been generated by Qwen 3.6 35b a3b.

[-]

balder1993@reddit

It’s actually generated by an LLM. Read again.

[-]

Abject-Kitchen3198@reddit

Thanks

[-]

ReinforcedKnowledge@reddit

Sometimes people are just asking for what models they can try out, just out curiosity or just to try stuff, it's fun

[-]

Herr_Drosselmeyer@reddit

Gemma is a much better general purpose model.

[-]

fzammetti@reddit

i9-13900K, 64Gb DDR-5, and an RTX 4070 w/12Gb VRAM.

I'mma go ahead and run pretty much anything I feel like (not literally, but you know what I mean).

(2023 PC build, thankfully in before the hardware apocalypse)

[-]

CreamPitiful4295@reddit

Ain’t no one even gonna try and stop you

[-]

Plane_Friend24@reddit

thanks this was helpful for a idiot like me.

[-]

Happy_Brilliant7827@reddit

Dont forget about the opus distills lol

[-]

deanpreese@reddit

Every Opus distill I have tried had a love for long almost infinite loops

[-]

Happy_Brilliant7827@reddit

I find opus distils help when translating 'common language' especially when a user might use the exact wrong term that sounds right. You can always give it a thinking budget

[-]

sickboy6_5@reddit

[-]

Markuska90@reddit

But what if my goal is to research Tianmen Square?

[-]

ttkciar@reddit

You misspelled Gemma-4-31B-it ;-)

[-]

balder1993@reddit

“Give me a speech of a grumpy Redditor arguing that people should stop asking on the sub what model to run and offer only two options: Qwen 3.6 35b a3b and Qwen 3.6 27b.”

[-]

FlyingDogCatcher@reddit

[-]

IkariDev@reddit

Worst ragebait in 2026 yet.

[-]

cosmicr@reddit

This is terrible advice.

[-]

1nicerBoye@reddit

Gemma 31B is amazing for writing in other languages than English. Man, i have to read and speak English at my job and a lot of stuff I like is in it; I find myself just craving some German. And Gemma does that better than anything else at that size. Q6, heretic and thinking = unbeatable and fits in 24 GB. 26B is okay too but it repeats some phrases too often for my tastes and overthinks a lot. Qwen just doesn't do that, it has, atleast in German, weird quirks where it directly translates things from English which do not work. Maybe Qwen 122B is better but I would guess at most marginally better than 31B.

[-]

somerandomperson313@reddit

This is one of the reasons why i use Gemma 4 99% of the time. It's the only model i can use in my native language(danish). It's also just a great model in general.

[-]

logic_prevails@reddit

Not true, gemma4 has specific uses

[-]

DataPhreak@reddit

Heh...

[-]

UnlikelyTomatillo355@reddit

decent bait. lots of takers. sage.

[-]

tavirabon@reddit

4 month old account ragebait

Easiest block of my life

[-]

Calm-Republic9370@reddit

is everyone on the same timeline of understanding now? Lets all just have 2 4090's and vlllm also.

[-]

Snoo_81913@reddit

Bro says Stahhhp there's only 2 models! Generates 800 thread sub-reddit about all the other models. 🤣🤣😂😅 bet hes rocking in a corner right now.

[-]

iijei@reddit

I have an RTX 3060 and.. I can run Qwen 3.6 35b a3b Q4 for about 15 tps. haha

[-]

Shronx_@reddit

You should get higher tps. My system with an rtx 3060 (12GB) spits out 40+ tps with the same model and quants.

[-]

iijei@reddit

it could be because this is runnin inside proxmox server ddr3

[-]

Shronx_@reddit

I run it in Docker and my PC has DDR4 3600. I don't know but the memory could make a difference.

[-]

Radiant-Giraffe5159@reddit

For MOE models you should try the APEX quants. They seem better for the most part than the standard quants. As for speed get the MTP version even if you have to put all the expert weights into system ram you should see a 3-5 token increase. If you do use MTP don’t go over 3 draft tokens it starts losing speed on the qwen3.6 models the sweet spot seems to be 2 for general and 3 for coding specific.

[-]

soniko_@reddit

I ran it on my laptop with a 6800s

It ran around 3tps.

[-]

sob727@reddit

Wrong. Those two models are gonna be lost on June 4th when you asked them what happened on that date.

[-]

anthonyg45157@reddit

What happened

[-]

sob727@reddit

Chinese models hate this one simple trick!

https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre

[-]

anthonyg45157@reddit

Damnit 💀 I actually knew about that but didn't know the exact date ..few years before my time but definitely learned about it in school.

I thought you were hinting at some top secret knowledge about a release coming up 😄

[-]

sob727@reddit

Yeah

I care about the truth more than I care about internet points. Watch me get downvoted by Chinese bots for telling the truth.

[-]

Conscious_Cut_6144@reddit

There are absolutely better local models than 27b.

[-]

stoppableDissolution@reddit

This must be ragebait, right? Right?

[-]

rc_ym@reddit

Gemma for anything creative tho. WAY better than Qwen at just about any quant.

[-]

a_beautiful_rhind@reddit

Your specs don’t matter. Your use case doesn’t matter.

My specs do matter. To me those models are small. To 3060 guy they are big.

[-]

andy_potato@reddit

This is not good advice at all. As many other posters have pointed out, it vastly depends on your use case.

[-]

fractalcrust@reddit

they hated him for he spoke the truth

[-]

Radiant-Giraffe5159@reddit

If you don’t have vram use the qwen3.6 moe model and an apex quant. It should fit in most systems with a decent gpu and 32gb of system ram.

[-]

JLeonsarmiento@reddit

Gemma 4 is not bad and the MoE can save you some RAM, isn’t it?

[-]

Confusion_Senior@reddit

I happen to have tested that and a garbage of a quant loses world knowledge and makes mistakes. Q4 great value, Q3_K_M barely acceptable, Q2... better to use Q4 of a smaller model.

However if yout model is 100B+ Q2 stays useful

[-]

Big_Wave9732@reddit

Preach.

Like the carnival rides of old there is a minimum requirement to ride, and many of y'all are too short. Self hosting LLMs ain't for everyone.

[-]

skate_nbw@reddit

Another preacher. Yawn.

[-]

redblood252@reddit

Complaining about a model running in rtx 3060 not being able to code at an independent high level is very reminiscent of people complaining their $200 laptop bought from a supermarket is slower than a $2000 macbook and concluding that windows is slow.

[-]

Zandarkoad@reddit

Dang. Here I am running a dozen simultaneous all-mpnet-base-v2s for classification at 85-95 MCC. Guess I've been doing it wrong!

[-]

grabber4321@reddit

gotta help the noobs figure it out, cant be holding all the knowledge to yourself.