Which one are you waiting for more: 9B or 35B?

[-]

Have you tried Qwen3-Next-80B-A3B or Qwen3-Coder-Next-80B-A3B with linear attention? The whole model fits in 24GB VRAM, and runs at ~50 t/s on my PC, so even if part of it spills over into RAM, it will probably still be fast enough to be very usable.

Reply

[-]

cristoper@reddit

> The whole model fits in 24GB VRAM How do you fit an 80b model in 24GB? what quant do you run?

Reply

[-]

Xantrk@reddit

I'm also running Unsloth IQ3S quant on 12gb VRAM + 32 gb RAM combo with 30tk/s. MOE models are wild for us GPU poors

Reply

[-]

huseynli@reddit

I am new to local llm. Started playing with it today. Haven't figured out fine tuning and intricacies (thinking vs non-thinking, A3B being 3B active despite 35b total and stuff). My current environment is Llama.cpp + openwebui. AMD 7700XT (12gb vram) + 32gb ddr5 ram. What would you say are the best models I should try, experiment with? Qwen 3.5 9b, qwen3.5 35 a3b, glm, open oss, etc? Sorry for info requesting like this. I just saw that we have similar hardware. Thank you.

Reply

[-]

social_tech_10@reddit

Unsloth [MXFP4_MOE](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/resolve/main/Qwen3-Coder-Next-MXFP4_MOE.gguf?download=true)

Reply

[-]

cristoper@reddit

Thanks. I'm going to try an mxfp4 quant and see if it works better than the q4_k_s from unsloth. But it's still a 4-bit quantization so it will require 40GB just for the weights alone... definitely won't fit in 24GB vram. I'm impressed you're getting 50 t/s on your pc. What is your hardware?

Reply

[-]

agoldin@reddit

I have no idea what hardware he has, but launch claude code, codex , or qwen with free plan in your llama.cpp directory, tell it to read information from this discussion (just give it URL) and unsloth page for the model (like [https://unsloth.ai/docs/models/qwen3-coder-next](https://unsloth.ai/docs/models/qwen3-coder-next) ), allow it to search on web and discuss how to best configure and compile your version of llama.cpp and later run ./build/llama-bench on your particular hardware. Just use llms via API to bootstrap your local llm.

Reply

[-]

techmago@reddit

> How do you fit an 80b model in 24GB? what quant do you run?

Reply

[-]

Firm_Meeting6350@reddit

Qwen3-Next-80B-A3B is amazing - still I wonder how a potential Qwen3.5-Next-80B would perform :D

Reply

[-]

Far-Low-4705@reddit

70b?? Why not 80b

Reply

[-]

Kat-@reddit

Oh, yeah, that too.

Reply

[-]

-InformalBanana-@reddit

Hahaha

Reply

[-]

Significant_Fig_7581@reddit

Honestly both! A 60B model would also be 🔥🔥🔥

Reply

[-]

Iory1998@reddit

Why not 80B? It works well on many consumer hardware.

Reply

[-]

Fresh_Finance9065@reddit

Something for the 8gb vram + 32gb ram systems would be nice. 60B is that nice size

Reply

[-]

Beautiful_Egg6188@reddit

HOW!!!, im struggling with 12gb Vram + 64gb Ram to run 27b q4\_k\_m with 32k context. Its 5t/s slow and keeps getting slower over time

Reply

[-]

Fresh_Finance9065@reddit

Use MoE models around 70B parameters and offload weights to cpu. I'm assuming you are using Qwen3.5 27B, which is dense. Dense will start slow and slow down faster.

Reply

[-]

Significant_Fig_7581@reddit

Oh I use the Qwen3 Coder Next all the time, But I don't think they'll give us another 80B in the same month...

Reply

[-]

Iory1998@reddit

They might. The initial Qwen-3 Next was undertrained. I think a properly trained Next model will be a beast.

Reply

[-]

Significant_Fig_7581@reddit

If you're mentioning the Qwen next instruct from last year? I think this one is already an update for that one, I still hope they get us another one but especially something around the 60B size cause I really want people with 16gb of ram and 16gb vram to experience how good Qwen can be even in low quants though I'm certain the 35B would be great too...

Reply

[-]

Iory1998@reddit

60B is also a good size, no doubt about it. But, I think they might release 2 dense models first: A 9B and 32B. I think with the hybrid attention, a Q4 of the 32B can fit in 24Vram with 127K context size.

Reply

[-]

Far-Low-4705@reddit

There are 2 confirmed models: 9b dense, and 35b MOE. There might be a 2b dense too but I forget. I’m really hoping for an 80b tho

Reply

[-]

Iory1998@reddit

35B MoE sounds like a 30BA3B with 5B vision encoder! I am hoping for an 80B too.

Reply

[-]

Far-Low-4705@reddit

yeah thats what i was thinking, though 5b sounds very big for a vision encoder tho lol I think this is the first nativley multimodal model (and probably one of the first big name local model) so im hoping the vision is crazy good! Tbh, qwen 3vl is already has basically perfect vision (considering their size) imo, so im super excited for 3.5 since i use these models for engineering lol

Reply

[-]

Iory1998@reddit

I absolutely agree regarding the capabilities of Qwen-3-VL. No local model comes close to their performance.

Reply

[-]

AlwaysLateToThaParty@reddit

Yeah, as far as vision models go, it's my go-to.

Reply

[-]

Far-Low-4705@reddit

Ppl with 16Gb of ram are very unlikely to have 16Gb of VRAM.. also u wouldn’t be able to fit any context. More likely 16Gb RAM with 6-8Gb VRAM, maybe 12. For those 8b dense or 30b MOE is best. I think 32Gb RAM and 16Gb VRAM is much more common, and where 80b is best.

Reply

[-]

finah1995@reddit

I have 16 GB with about 6 GB VRAM, expecting smaller models.

Reply

[-]

Significant_Fig_7581@reddit

I disagree I've seen a lot of people with only 32gb capacity when combining their ram and vram...

Reply

[-]

Far-Low-4705@reddit

I think it’s more likely fo an 80b model But I’m reeeeally hoping we do get an 80b

Reply

[-]

hesperaux@reddit

Same. I'm literally checking like 4 times a day. Finally being able to do coding and vision with one model is gonna unlock one of my gpus for other tasks. Qwen3.5-coder-80b would make me moist.

Reply

[-]

Far-Low-4705@reddit

they just released qwen 3.5 122b a10b on qwen chat, so i doubt there will be an 80b sadly, but no wieghts yet.

Reply

[-]

hesperaux@reddit

Yeah. 😞They are in hugging face now too. I guess I'll try to do with 122b at q3. 😬I really don't like using less than q4_k_m and I was hyping to do q6. Maybe they will release an 80b down the line similar to Next.

Reply

[-]

Far-Low-4705@reddit

yeah, that'd be super nice, but i doubt it.. dang that sucks, 80b would have been IDEAL for me. but those extra 42b parameters are just too much. Gonna be hard to replace the qwen 3 next models for now ig.

Reply

[-]

Significant_Fig_7581@reddit

But didn't we just get an 80B? But you're right I also wanna see another one lol

Reply

[-]

Far-Low-4705@reddit

80b is a really great size, you can run it at full context with only 48Gb. That’s basically a standard mid tier PC. (16Gb GPU & 32Gb VRAM)

Reply

[-]

journalofassociation@reddit

Somehow I'm able to run it in 36GM VRAM with 48k context (q3). Not sure how it runs so efficiently.

Reply

[-]

shortfinal@reddit

16gb gpu with 32gb vram? What?

Reply

[-]

Far-Low-4705@reddit

ah, sorry, typo, meant 16Gb VRAM with 32Gb RAM

Reply

[-]

IMightBeAlpharius@reddit

I know what you meant :D

Reply

[-]

jinnyjuice@reddit

Anything that fits in 100GB memory including 100k+ tokens!

Reply

[-]

insmek@reddit

Give me the best model that I can use on my 128GB MacBook or give me death.

Reply

[-]

Funny_Working_7490@reddit

Anything for m4 air 256 gb?

Reply

[-]

Daniel_H212@reddit

Minimax M2.5 must work quite well on that system.

Reply

[-]

MoffKalast@reddit

One death, coming right up.

Reply

[-]

zipzag@reddit

The Qwen Next 80Bs, with includes Qwen Coder Next, are already here. 40GB to 80GB

Reply

[-]

megacewl@reddit

Best we can do is Q4, 3B

Reply

[-]

CriticismNo3570@reddit

Waiting for R2 , but don;t expect that alone to affect the NASDAQ much

Reply

[-]

Open-Raise-6676@reddit

For MoE model, I hope they have more parameters but less activated parameter

Reply

[-]

vhthc@reddit

I hope they do again a 32b dense

Reply

[-]

ttkciar@reddit

Me too! But maybe someone will distill into Qwen3-32B?

Reply

[-]

vhthc@reddit

They released a 27b with impressive scores

Reply

[-]

ttkciar@reddit

Yup :-) I just downloaded it last night! It is very pleasing so far.

Reply

[-]

Traditional-Card6096@reddit

Would love to see a 9B run smoothly on iphone

Reply

[-]

Material-Ad5426@reddit

Very curious for anything that can work on a bit more standard office laptop gpu 🙏

Reply

[-]

Tai9ch@reddit

I can't wait for the 85B. Right now I'm running both 30B-VL and 80B-Coder, and it'd be nicer if I could just run the big model for both.

Reply

[-]

hesperaux@reddit

Exactly

Reply

[-]

MinusKarma01@reddit

Honestly, a 1.7B one. It is crazy how good the Qwen3 version of it is. Can be run on CPU if you need cheap and can be run on GPU if you need fast. Can correctly summarize, extract and classify multilingual texts (EU languages) while correctly following instructions. I found it to be the perfect ratio of size to quality.

Reply

[-]

peregrinefalco9@reddit

9B all day. The 35B models are impressive but the hardware requirements put them out of reach for most people running local. A genuinely good 9B that fits in 8GB VRAM would change more workflows than another 35B that needs a 3090.

Reply

[-]

Daniel_H212@reddit

if the 35B is a sparse MoE then its well within reach of anyone with more than 32 GB of RAM.

Reply

[-]

hakanavgin@reddit

Yeah that is exactly what op said, it puts them out of reach for most people running local. Most people use 16 GB RAM these days, even then with Windows, background apps and kv cache, you get no more than 4-6 gigabytes for running models. I've got 16 gigs of VRAM and 16 gigs of RAM so I consider myself above average for total amount of fast memory, and can't run anything more than 14B at 32-48k ctx@q4_k_m at any usable speeds and comfortable memory usage. Most people overestimate the "average guy" these days.

Reply

[-]

Daniel_H212@reddit

35B sparse is also still usable with 16 GB of RAM and 8 GB of VRAM, which wouldn't even be unusual for a system built 8 years ago with an RX 580. I don't think it's out of the reach of most people interested in local LLMs.

Reply

[-]

hakanavgin@reddit

Not really, I may be doing something wrong but even GLM4.7 Flash at q3 is a stretch with all the windows stuff in the background. I'm dual booting Arch as well and the experience is more or less the same. It is either unbearably slow, like sub 10 tok/s or outright doesn't load at all. It is possible to load and somehow use, but at this point you are not doing anything other than inference and it is not a realistic scenario. I think realistically at least 48 gigs of total memory is a must for actually incorporating local llms to any workflow.

Reply

[-]

Anduin1357@reddit

Feels bad when I have more VRAM than people have RAM, and then to have more RAM than people have RAM+VRAM combined. All that to power Qwen3-Next-80B_q6_k_xl at reasonable speeds. I'd consider anyone who can run Qwen3-30B-A3B-Instruct-2507_q4_0 to be average.

Reply

[-]

TheRealMasonMac@reddit

You don't even need much RAM. It was shockingly fast inferring from my NVME drive that I didn't even realize it wasn't using my RAM until I checked.

Reply

[-]

Far-Low-4705@reddit

well, if your looking to purchase a new laptop or PC with this in mind, its really not that big of a deal to go from 16Gb to 32Gb. Id argue 16Gb is not enough for windows anyway, i have 32 and windows uses 20-24Gb in the background. Probably gonna switch to linux tbh, i only have windows cuz i need it for CAD and engineering software

Reply

[-]

Tai9ch@reddit

Do you have a gaming machine that you're running some AI on, or have you intentionally built towards running AI models? Because yes, reasonable gaming setups tend to max out at a single 16GB GPU, which makes a 30-35B model kind of crap. But as soon as you're buying hardware to run AI models, options like Strix Halo or 2x 3090's start to be entirely feasible, and at that point 35B (or even 80B) becomes entirely feasible.

Reply

[-]

Thunderstarer@reddit

On my 16GB RX 9060 XT, Qwen3Coder 30B A3B _just barely_ fits at IQ_3_K_XL with 30K context, and I can get a decently useable inference speed of 115T/s. If I throw my 8GB RX 480 in, I can go up to about 64K context, but inference speed drops to about 55T/s, and prompt processing speed gets absolutely murdered.

Reply

[-]

EnthropicBeing@reddit

Are you implementing ~9B models in any workflow nowadays? I'm genuinely interested since I'm a total amateur and couldn't find any use for them.

Reply

[-]

andy2na@reddit

I keep qwen3-vl:4b iq4 in vram for Frigate image analyzing, home assistant voice assistant, karakeep, open-notebook, and general questions and it works great. For more complicated tasks like Sure Finance to analyze my finances, I'll temporarily load in qwen3-vl:8b-instruct-q4_K_M. Looking forward to 3.5:9b to compare

Reply

[-]

EnthropicBeing@reddit

That's awesome. Frigate default AI engine was more than enough for me. I wonder how your setup worked

Reply

[-]

andy2na@reddit

Connecting a llm to frigate will generate AI descriptions for events. Also in the new 0.17 versions, it'll generate AI summaries for notifications. It elevates frigate to the next level and highly recommended

Reply

[-]

nakedspirax@reddit

The 3090 was released 6 years ago. Maybe it's time to get with the times.

Reply

[-]

IrisColt@reddit

HEH!

Reply

[-]

datfalloutboi@reddit

Low cost and good vram. Not much else you can ask for. The 4090 is another option but those are hard to find for decent prices.

Reply

[-]

jonydevidson@reddit

No idea why you're getting down votes. 5090 is not out of stock, it's just absurdly priced but you gotta pay to play and everyone wants to play this AI game.

Reply

[-]

peregrinefalco9@reddit

The 5090 has been out of stock since launch. Most people are still running 3090s or less - a strong 9B model helps them today, not in theory.

Reply

[-]

nakedspirax@reddit

Get a 3090 like your original comment. It's almost 6 years old now. 4 more years and it's a decade old GPU.

Reply

[-]

AppealThink1733@reddit

And what about qwen 3.5 4B?

Reply

[-]

JumpyAbies@reddit

What about qwen 3.5 0.6B?

Reply

[-]

swagonflyyyy@reddit

Honestly that would open the way for end user on-device NPCs.

Reply

[-]

KaosNutz@reddit

im waiting on this one as well, qwen 3 4b is good enough for web search on open-webui, I just need to setup playwright as fetch\_url can't open some websites

Reply

[-]

AppealThink1733@reddit

I didn't like it for web browsing. The best 4B I found for that purpose was the ZwZ 4B. It's excellent for that.

Reply

[-]

KaosNutz@reddit

I think I'm using the instruct version from the ollama library, base model wasn't so nice. will check that one out

Reply

[-]

hum_ma@reddit

And qwen 3.5 1.7B

Reply

[-]

TeamCaspy@reddit

What about second breakfast?

Reply

[-]

Amazing_Athlete_2265@reddit

What about first breakfast?

Reply

[-]

dances_with_gnomes@reddit

I might be able to run 9B. No way I can run 35B.

Reply

[-]

sloth_cowboy@reddit

Specs? I don't have any input, just curious.

Reply

[-]

dances_with_gnomes@reddit

GeForce GTX 1660 Ti with 6 gb vram and 16 gb of RAM. Ryzen 7 2700X if that matters, I honestly don't know much about these!

Reply

[-]

Daniel_H212@reddit

It's doable if you use IQ2\_XXS 😂

Reply

[-]

Neither-Phone-7264@reddit

i run 30b (granted moe) models on a rtx 3060 laptop which is 6gb + 24gb so its definitely doable with lower quants

Reply

[-]

AlwaysLateToThaParty@reddit

Are you happy enough with the outputs?

Reply

[-]

Neither-Phone-7264@reddit

well they're no opus 4.6 but they're perfectly usable especially for smaller applications like my own deep search agent

Reply

[-]

Significant_Fig_7581@reddit

I think you could offload it partially from your ram, use an IQ3 XXS quant and enjoy the luxury of a 35B model

Reply

[-]

Initial-Argument2523@reddit

Hope we get a new 4B dense for this reason

Reply

[-]

Straight_Abrocoma321@reddit

Same

Reply

[-]

NoobMLDude@reddit

Waiting for Qwen3.5-Coder

Reply

[-]

jacek2023@reddit (OP)

how is this different than Qwen Next Coder? what size do you expect?

Reply

[-]

NoobMLDude@reddit

Would be nice to have 30B MOE size.

Reply

[-]

NoobMLDude@reddit

Would be nice to have 30B MOE size or smaller.

Reply

[-]

ribsdug@reddit

9B is all my poor hardware can run at the moment. 35B is a dream 😭

Reply

[-]

InvDeath@reddit

why 9b?

Reply

[-]

ManufacturerWeird161@reddit

Waiting for the 35B, my 3090 is ready to push batch size to its absolute limit for that model size.

Reply

[-]

Hanselltc@reddit

Honestly both. The big 3.5 has vision, hopefully the small 3.5's also have it. Also looking forward to whatever Gemma is cooking, I really liked the 12B.

Reply

[-]

TokenRingAI@reddit

140B-A15B

Reply

[-]

EbbNorth7735@reddit

Would love this. Playing with Qwen3.3 397B and it's surprisingly fast. However A15B is a bit high for the spares MoE's they've been making. Maybe a A12B or even A8B/A9B.

Reply

[-]

ttkciar@reddit

I am curious: Why 140B specifically? Is there a GPU configuration for which 140B is optimal use of VRAM?

Reply

[-]

TokenRingAI@reddit

RTX 6000, model would be ~ 70-80GB at FP4

Reply

[-]

jacek2023@reddit (OP)

that would be awesome

Reply

[-]

kasinjsh@reddit

xx-A3B-xx

Reply

[-]

Conscious_School6035@reddit

As someone running local LLMs daily, I'd take a well-optimized 9B over a demanding 35B any day. Accessibility matters more than raw power for most users!

Reply

[-]

Turkino@reddit

35b since it's not as often to get anything in the 70b range these days.

Reply

[-]

SpicyWangz@reddit

Next was 80b. That’s pretty close to 70

Reply

[-]

charles25565@reddit

1.5B-ish and 0.5B-ish :D

Reply

[-]

silenceimpaired@reddit

Neither. 100-200b. And they won’t be coming.

Reply

[-]

zipzag@reddit

They do need an MOE between 80B and the older 235B.

Reply

[-]

jacek2023@reddit (OP)

well, I wish the 80B would get updated

Reply

[-]

Ylsid@reddit

Whichever runs on my 3090

Reply

[-]

CarlCarlton@reddit

goekdeniz-guelmez\_josiefied-qwen3.5-9b-abliterated-v1

Reply

[-]

swagonflyyyy@reddit

35b

Reply

[-]

DeepOrangeSky@reddit

I wonder if maybe Qwen3.5 35b accidentally got eaten by hippos. Maybe if it still doesn't get released in the next day or two, meaning we can be pretty sure that is what happened, we can all hold a candlelight vigil in remembrance of what a nice, wonderful local AI model it could have been, if it hadn't met such a tragic and untimely demise. Maybe people can come up with some poems or song lyrics that we can quietly chant when we hold our candlelight vigils in memory of Qwen3.5 35b. If it turns out that its slightly mentally challenged brother, Qwen3.5 9b also got eaten, then we can hold vigils for that as well, although that would be so tragic that we should not speak of such possibilities for now. Most likely it is just playing on the rainbow farm where your pet dog went on a super long vacation and you never saw it again when it got old. So, once it finds its way back from the rainbow farm, all will be well.

Reply

[-]

kwinz@reddit

why no 120B ?

Reply

[-]

RayHell666@reddit

Me waiting for their next vision model...

Reply

[-]

ciprianveg@reddit

a 235-300b with VL model will fit perfectly on my 8x3090 setup.. the 398b one forces me to buy more gpus..

Reply

[-]

Its_Powerful_Bonus@reddit

Love that amount of vram. Qwen3.5 works like a charm with unsloth iq3_xxs and context quantization set to q8. Even RoPe for 512k worked in koboldcpp. Im running 2x rtx 6000 pro.

Reply

[-]

Its_Powerful_Bonus@reddit

100b-200b a10b multimodal with 1M context which is memory efficient. Waiting for Nemotron 3 Super 100b a10b, but hope that other teams will also go this way

Reply

[-]

DeepOrangeSky@reddit

Some more dense models between 30b and 120b would be awesome. If they decide to skip the medium sized dense models this time around (which would be a huge shame, but wouldn't surprise me, given how things have been trending), then some not-so-sparse MoE like a 100b a10b or 70b a8b or something might be interesting (not sure if it would do what I think it could do, or if it would be a bad idea, but, I dunno, maybe it would be awesome, lol)

Reply

[-]

Its_Powerful_Bonus@reddit

Dense model are not power efficient, long context costs a lot. Everything which is important for larger scale deployments are hard to get with dense models.

Reply

[-]

ttkciar@reddit

Yep, a dense model in the 12B-to-14B range would be great for folks with 16GB VRAM, and a dense model in the 24B-to-32B range would be great for 32GB VRAM.

Reply

[-]

toothpastespiders@reddit

>which would be a huge shame, but wouldn't surprise me, given how things have been trending Yeah, I think even more than wanting to actually use a new mid-sized dense model from Qwen I'd like to see it simply as a suggestion that the industry as a whole hasn't dropped them for MoEs.

Reply

[-]

singhapura@reddit

Nothing stops you making your own.

Reply

[-]

ttkciar@reddit

Nothing, except costing more $$$ than a luxury sedan.

Reply

[-]

beedunc@reddit

Yes.

Reply

[-]

selnatic@reddit

35B, and its not even close. 9B is cute for quick local tinkering, but it hits that "sounds confident, misses the point" wall the second you want real reasoning or tool use. A good 35B quant on a 24GB card (or split across CPU/GPU with llama.cpp) is where it starts feeling like an actual assistant instead of autocomplete. The people hyping 9B are mostly just flexing that it runs on a laptop.

Reply

[-]

ALittleBitEver@reddit

Waiting for Qwen 3.5 4B 💀

Reply

[-]

alexp702@reddit

A draft model for 397b!

Reply

[-]

Far-Low-4705@reddit

Qwen 3.5 80b …Hopefully with vision

Reply

[-]

venkada_321@reddit

0.6b less goooo. Mobile users

Reply

[-]

wanderer_4004@reddit

I am waiting for a flexi model that automatically adjusts from 8-80B and from A1B to A10B and also switching between thinking and non-thinking depending on the task at hand, the available memory and the available hardware. I.e. given a simple task it behaves like a 8B1 model, and given a difficult task it behaves like 80B A10B with thinking. In the latter case it will use itself in 8B1 for speculative decoding.

Reply

[-]

m_mukhtar@reddit

35b for sure. I wish they creat one with a bit more active parameters. So.ething like 70b with A5b as i think the a active part affects intellegance more that the total parameters which affects knowladge more (not a a clear black and white for sure but a gemeral observation)

Reply

[-]

toothpastespiders@reddit

>not a a clear black and white for sure but a general observation So far the only mid-size MoE that doesn't have that idiot savant feel to me is Air with 106b 12a.

Reply

[-]

xandep@reddit

Probably tomorrow. Source: my head. But seriously, Monday is a hot day for model releases.

Reply

[-]

t_krett@reddit

Please don't say that, I am tempted to wait for the Monday morning sun to rise on China

Reply

[-]

vovxbroblox@reddit

0.2b, i need to feed my rpi 2w zero.

Reply

[-]

LegacyRemaster@reddit

Since I have Qwen3.5-397B-A17B-UD I can finally stop using non-local LLMs.

Reply

[-]

joblesspirate@reddit

I finally got it working since they patched the llama.cpp bug. I love it!

Reply

[-]

Darth_Ender_Ro@reddit

Better Q: what are you using them for?

Reply

[-]

Adventurous-Paper566@reddit

35B, si ça rentre en Q6 dans 32Gb de RAM avec un contexte > 8k

Reply

[-]

Initial-Argument2523@reddit

If we get a new 32B dense it could potentially be quite interesting to prune it down to 24B

Reply

[-]

MerePotato@reddit

Yes

Reply

[-]

DayshareLP@reddit

20b would be nice

Reply

[-]

MerePotato@reddit

35B any day, 24GB VRAM is the consumer hardware sweet spot

Reply

[-]

stoppableDissolution@reddit

None, tbh. Qwen models have been raw disappointment since 2.5 and qwq.

Reply

[-]

cdshift@reddit

For what usecases? Qwen3 coder next is s daily drive for me on my local setup with open code

Reply

[-]

stoppableDissolution@reddit

I gave up on local coding long ago, so idk about that. But for things like classification/ranking/synthetic data generation/etc qwen is kinda sad compared to heavily quantized mistral large or glm air or gemma. Other case is rp, but it never even entered competition there.

Reply

[-]

Tai9ch@reddit

Local coding works fine now, starting at about 48GB of VRAM.

Reply

[-]

stoppableDissolution@reddit

Its sloooooow and significantly worse than claude. Like, theres nothing private about my code that is being pushed on public github anyway, so why use worse product?

Reply

[-]

Tai9ch@reddit

Being slow is an issue with your hardware specs. How well it performs is an issue with model choice and workflow. Sure, you're not going to get the same one-shot performance from an 80B model like Qwen3-Coder-Next that you'd get from a 400B+ model like Claude Opus. But there are certainly several open models in the same broad capability class as the frontier proprietary models, and there's a pretty smooth gradient from even 30B models up to 1T models with all of them being useful for coding.

Reply

[-]

stoppableDissolution@reddit

Well, sure, big glm is good enough to replace closed models, but nothing that can run in 48gb is remotely useful in my opinion. It takes more effort to make 80b qwen or, idk, 32b glm produce good-enough code than to write it myself.

Reply

[-]

Tai9ch@reddit

I've had good results from 80B Qwen, and that'll run in 48GB at Q4 even with a reasonable amount of context, and it'll be *fast* on something like a pair of 3090s. Of course bigger is better. And bigger is entirely achievable with a little bit of effort (and a moderate to gargantuan amount of money).

Reply

[-]

GraybeardTheIrate@reddit

Definitely interested in a 35B, especially if it's dense.

Reply

[-]

10minOfNamingMyAcc@reddit

Personally, 35B but 9B doesn't sound too bad either.

Reply

[-]

PANIC_EXCEPTION@reddit

9B because it would be amazing to see it work on my phone. My laptop can already run Qwen-Coder-Next 80B and it works really well for general purpose as well.

Reply

[-]

Cool-Chemical-5629@reddit

https://i.redd.it/u5jr2o6bi3lg1.gif

Reply

[-]

SuchAGoodGirlsDaddy@reddit

Honestly a SOTA 9B would be big for me right now. Of course I’ll happily wait for TheDrummer to get ahold of it.

Reply

[-]

jacek2023@reddit (OP)

Are there any qwen finetunes from u/TheLocalDrummer?

Reply

[-]

Lesser-than@reddit

9b just because I know it will fit on anything I own, I get excited for just about anything qwen though, as they continue to set a solid groundwork for the future of llms.

Reply

[-]

Opening-Ad6258@reddit

9b because I can actually run it

Reply

[-]

DenZNK@reddit

Share how you use it pls. I can't understand why I would need it, since I use cloud services. I have an RTX 5080. What tasks could it be used for besides STT or TTS?

Reply

[-]

jacek2023@reddit (OP)

This sub is about using LLMs locally, not in the cloud

Reply

[-]

DenZNK@reddit

That's why I'm asking what it will be used for, in case I need it, since my video card is currently only used for gaming :)

Reply

[-]

jacek2023@reddit (OP)

that's a really big topic, but basically if you like spending time on gaming and don't want to learn new things then probably local LLM won't be interesting for you

Reply

[-]

DenZNK@reddit

I haven't been playing much lately—I haven't turned anything on for over a month and spend most of my time vibing coding :)

Reply

[-]

jacek2023@reddit (OP)

then explore this sub

Reply

[-]

deathentry@reddit

Will 9B work with 8GB VRAM? I can only have 35k context window which means I can't even work angular mcp 🤣 😅

Reply

[-]

sciencewarrior@reddit

It should be about 5GB with a 4-bit quantization, leaving a couple GB for a decent context size.

Reply

[-]

JumpyAbies@reddit

But he just launched... Soon someone else will be crying about the 7b, 3b, 0.6b, 1m, 1k

Reply

[-]

Septerium@reddit

Both would be cute toys to play with

Reply

[-]

-InformalBanana-@reddit

35B only if it is a MOE, otherwise 9B. But for me 80 or 90BA3B would be good MOE, cause I have 96GB ram. Or maybe they should try A4B MOE cause Qwen 4B has good performance for it size so maybe that would translate good into MOE, hopefully that won't slow the model down too much.

Reply

[-]

Look_0ver_There@reddit

I'll take a 120B one thanks!

Reply

[-]

Zestyclose-Shift710@reddit

35b if moe

Reply

[-]

YoussofAl@reddit

4B. You’re all sleeping on 4B 2507. My favourite model.

Reply

[-]

Ardalok@reddit

Personally, I'm looking forward to Gemma 4 more.

Reply

[-]

jeekp@reddit

nemotron 3 super nvfp4 on llama.cpp

Reply

[-]

LushHappyPie@reddit

7B to 12B with Test Time Training. I couldn't care less about 5% stronger reasoning or 7% stronger agentic performance in a local model.

Reply

[-]

pigeon57434@reddit

35B definitely

Reply

[-]

johnmacleod99@reddit

9B

Reply

[-]

NullKalahar@reddit

9b

Reply

[-]

cruzanstx@reddit

35b

Reply

[-]

DrNavigat@reddit

As long as they aren't thinking models that waste my hardware with tokens that barely alter the final answer and clutter my context...

Reply

[-]

ab2377@reddit

there should be a menu, like in the restaurants, "what parameters count will you like to have?", you click 9, "your order will be served in 5 minutes", you click download after 5 minutes.

Reply

[-]

Conscious_Nobody9571@reddit

9B pls

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]

Black-Mack@reddit

Qwen 3.5 1.5B

Reply

[-]

bene_42069@reddit

9B prolly the only one I can run with my laptop lmao

Reply

[-]

FantasticProcedure46@reddit

Qwen3.5-VL-9B

Reply

[-]

ilintar@reddit

3.5 is VL by default.

Reply

[-]

Alby407@reddit

None of them.

Reply

[-]

Single_Ring4886@reddit

I will look like some kind of Qwen fanboy but I must say that as opensource models go their is best. It feels like their models are well balanced not obsesed with just coding like glm or kimi etc. Maybe new DS will be good but then again it will have 700B

Reply

[-]

Confident-Aerie-6222@reddit

A good 4B multilingual model that beats gemma models at translation abilities and is also good at logic, thinking and coding.

Reply

[-]

dampflokfreund@reddit

35B A3B. Probably a lot better than 9B and still fast enough.

Reply

[-]

Slow_Concentrate3831@reddit

Between 14b and 20b would be cool

Reply

[-]

JockY@reddit

235B!

Reply

[-]

somkomomko@reddit

I have a 36gb MacBook sadly it doesn't fit 32b for anything useful and inference is so slow

Reply

[-]

jacek2023@reddit (OP)

you should compare to 30B A3B

Reply

[-]

Zyj@reddit

The 397B works ok

Reply

[-]

mehhxx@reddit

I am keeping my hopes up for an extensive list of options just like Qwen 3 was, as even a 0.6b reasoning model would come in incredibly handy for very low-end devices and edge cases.

Reply

[-]