TheaterFire

Which one are you waiting for more: 9B or 35B?

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 220 comments

Which one are you waiting for more: 9B or 35B?

Reply to Post

220 Comments

Kat-@reddit

70B-3A
View on Reddit #79029765

social_tech_10@reddit

Have you tried Qwen3-Next-80B-A3B or Qwen3-Coder-Next-80B-A3B with linear attention? The whole model fits in 24GB VRAM, and runs at ~50 t/s on my PC, so even if part of it spills over into RAM, it will probably still be fast enough to be very usable.
View on Reddit #79030126

cristoper@reddit

> The whole model fits in 24GB VRAM How do you fit an 80b model in 24GB? what quant do you run?
View on Reddit #79040201

Xantrk@reddit

I'm also running Unsloth IQ3S quant on 12gb VRAM + 32 gb RAM combo with 30tk/s. MOE models are wild for us GPU poors
View on Reddit #79140077

huseynli@reddit

I am new to local llm. Started playing with it today. Haven't figured out fine tuning and intricacies (thinking vs non-thinking, A3B being 3B active despite 35b total and stuff). My current environment is Llama.cpp + openwebui. AMD 7700XT (12gb vram) + 32gb ddr5 ram. What would you say are the best models I should try, experiment with? Qwen 3.5 9b, qwen3.5 35 a3b, glm, open oss, etc? Sorry for info requesting like this. I just saw that we have similar hardware. Thank you.
View on Reddit #80077514

social_tech_10@reddit

Unsloth [MXFP4_MOE](https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/resolve/main/Qwen3-Coder-Next-MXFP4_MOE.gguf?download=true)
View on Reddit #79079446

cristoper@reddit

Thanks. I'm going to try an mxfp4 quant and see if it works better than the q4_k_s from unsloth. But it's still a 4-bit quantization so it will require 40GB just for the weights alone... definitely won't fit in 24GB vram. I'm impressed you're getting 50 t/s on your pc. What is your hardware?
View on Reddit #79081451

agoldin@reddit

I have no idea what hardware he has, but launch claude code, codex , or qwen with free plan in your llama.cpp directory, tell it to read information from this discussion (just give it URL) and unsloth page for the model (like [https://unsloth.ai/docs/models/qwen3-coder-next](https://unsloth.ai/docs/models/qwen3-coder-next) ), allow it to search on web and discuss how to best configure and compile your version of llama.cpp and later run ./build/llama-bench on your particular hardware. Just use llms via API to bootstrap your local llm.
View on Reddit #79138447

techmago@reddit

> How do you fit an 80b model in 24GB? what quant do you run?
View on Reddit #79041075

Firm_Meeting6350@reddit

Qwen3-Next-80B-A3B is amazing - still I wonder how a potential Qwen3.5-Next-80B would perform :D
View on Reddit #79030695

Far-Low-4705@reddit

70b?? Why not 80b
View on Reddit #79038667

Kat-@reddit

Oh, yeah, that too.
View on Reddit #79039132

-InformalBanana-@reddit

Hahaha
View on Reddit #79045058

Significant_Fig_7581@reddit

Honestly both! A 60B model would also be 🔥🔥🔥
View on Reddit #79029613

Iory1998@reddit

Why not 80B? It works well on many consumer hardware.
View on Reddit #79041548

Fresh_Finance9065@reddit

Something for the 8gb vram + 32gb ram systems would be nice. 60B is that nice size
View on Reddit #79061294

Beautiful_Egg6188@reddit

HOW!!!, im struggling with 12gb Vram + 64gb Ram to run 27b q4\_k\_m with 32k context. Its 5t/s slow and keeps getting slower over time
View on Reddit #79756670

Fresh_Finance9065@reddit

Use MoE models around 70B parameters and offload weights to cpu. I'm assuming you are using Qwen3.5 27B, which is dense. Dense will start slow and slow down faster.
View on Reddit #79759841

Significant_Fig_7581@reddit

Oh I use the Qwen3 Coder Next all the time, But I don't think they'll give us another 80B in the same month...
View on Reddit #79042026

Iory1998@reddit

They might. The initial Qwen-3 Next was undertrained. I think a properly trained Next model will be a beast.
View on Reddit #79043105

Significant_Fig_7581@reddit

If you're mentioning the Qwen next instruct from last year? I think this one is already an update for that one, I still hope they get us another one but especially something around the 60B size cause I really want people with 16gb of ram and 16gb vram to experience how good Qwen can be even in low quants though I'm certain the 35B would be great too...
View on Reddit #79043533

Iory1998@reddit

60B is also a good size, no doubt about it. But, I think they might release 2 dense models first: A 9B and 32B. I think with the hybrid attention, a Q4 of the 32B can fit in 24Vram with 127K context size.
View on Reddit #79047457

Far-Low-4705@reddit

There are 2 confirmed models: 9b dense, and 35b MOE. There might be a 2b dense too but I forget. I’m really hoping for an 80b tho
View on Reddit #79066008

Iory1998@reddit

35B MoE sounds like a 30BA3B with 5B vision encoder! I am hoping for an 80B too.
View on Reddit #79068357

Far-Low-4705@reddit

yeah thats what i was thinking, though 5b sounds very big for a vision encoder tho lol I think this is the first nativley multimodal model (and probably one of the first big name local model) so im hoping the vision is crazy good! Tbh, qwen 3vl is already has basically perfect vision (considering their size) imo, so im super excited for 3.5 since i use these models for engineering lol
View on Reddit #79076096

Iory1998@reddit

I absolutely agree regarding the capabilities of Qwen-3-VL. No local model comes close to their performance.
View on Reddit #79079247

AlwaysLateToThaParty@reddit

Yeah, as far as vision models go, it's my go-to.
View on Reddit #79094404

Far-Low-4705@reddit

Ppl with 16Gb of ram are very unlikely to have 16Gb of VRAM.. also u wouldn’t be able to fit any context. More likely 16Gb RAM with 6-8Gb VRAM, maybe 12. For those 8b dense or 30b MOE is best. I think 32Gb RAM and 16Gb VRAM is much more common, and where 80b is best.
View on Reddit #79065916

finah1995@reddit

I have 16 GB with about 6 GB VRAM, expecting smaller models.
View on Reddit #79085690

Significant_Fig_7581@reddit

I disagree I've seen a lot of people with only 32gb capacity when combining their ram and vram...
View on Reddit #79066582

Far-Low-4705@reddit

I think it’s more likely fo an 80b model But I’m reeeeally hoping we do get an 80b
View on Reddit #79038625

hesperaux@reddit

Same. I'm literally checking like 4 times a day. Finally being able to do coding and vision with one model is gonna unlock one of my gpus for other tasks. Qwen3.5-coder-80b would make me moist.
View on Reddit #79169597

Far-Low-4705@reddit

they just released qwen 3.5 122b a10b on qwen chat, so i doubt there will be an 80b sadly, but no wieghts yet.
View on Reddit #79195152

hesperaux@reddit

Yeah. 😞They are in hugging face now too. I guess I'll try to do with 122b at q3. 😬I really don't like using less than q4_k_m and I was hyping to do q6. Maybe they will release an 80b down the line similar to Next.
View on Reddit #79209733

Far-Low-4705@reddit

yeah, that'd be super nice, but i doubt it.. dang that sucks, 80b would have been IDEAL for me. but those extra 42b parameters are just too much. Gonna be hard to replace the qwen 3 next models for now ig.
View on Reddit #79213720

Significant_Fig_7581@reddit

But didn't we just get an 80B? But you're right I also wanna see another one lol
View on Reddit #79038863

Far-Low-4705@reddit

80b is a really great size, you can run it at full context with only 48Gb. That’s basically a standard mid tier PC. (16Gb GPU & 32Gb VRAM)
View on Reddit #79040035

journalofassociation@reddit

Somehow I'm able to run it in 36GM VRAM with 48k context (q3). Not sure how it runs so efficiently.
View on Reddit #79057350

shortfinal@reddit

16gb gpu with 32gb vram? What?
View on Reddit #79040587

Far-Low-4705@reddit

ah, sorry, typo, meant 16Gb VRAM with 32Gb RAM
View on Reddit #79048432

IMightBeAlpharius@reddit

I know what you meant :D
View on Reddit #79052607

jinnyjuice@reddit

Anything that fits in 100GB memory including 100k+ tokens!
View on Reddit #79031507

insmek@reddit

Give me the best model that I can use on my 128GB MacBook or give me death.
View on Reddit #79057588

Funny_Working_7490@reddit

Anything for m4 air 256 gb?
View on Reddit #79111949

Daniel_H212@reddit

Minimax M2.5 must work quite well on that system.
View on Reddit #79120683

MoffKalast@reddit

One death, coming right up.
View on Reddit #79097684

zipzag@reddit

The Qwen Next 80Bs, with includes Qwen Coder Next, are already here. 40GB to 80GB
View on Reddit #79080395

megacewl@reddit

Best we can do is Q4, 3B
View on Reddit #79079794

CriticismNo3570@reddit

Waiting for R2 , but don;t expect that alone to affect the NASDAQ much
View on Reddit #79455656

Open-Raise-6676@reddit

For MoE model, I hope they have more parameters but less activated parameter
View on Reddit #79328644

vhthc@reddit

I hope they do again a 32b dense
View on Reddit #79034252

ttkciar@reddit

Me too! But maybe someone will distill into Qwen3-32B?
View on Reddit #79068824

vhthc@reddit

They released a 27b with impressive scores
View on Reddit #79286297

ttkciar@reddit

Yup :-) I just downloaded it last night! It is very pleasing so far.
View on Reddit #79296921

Traditional-Card6096@reddit

Would love to see a 9B run smoothly on iphone
View on Reddit #79275738

Material-Ad5426@reddit

Very curious for anything that can work on a bit more standard office laptop gpu 🙏
View on Reddit #79208774

Tai9ch@reddit

I can't wait for the 85B. Right now I'm running both 30B-VL and 80B-Coder, and it'd be nicer if I could just run the big model for both.
View on Reddit #79048049

hesperaux@reddit

Exactly
View on Reddit #79169809

MinusKarma01@reddit

Honestly, a 1.7B one. It is crazy how good the Qwen3 version of it is. Can be run on CPU if you need cheap and can be run on GPU if you need fast. Can correctly summarize, extract and classify multilingual texts (EU languages) while correctly following instructions. I found it to be the perfect ratio of size to quality.
View on Reddit #79168668

peregrinefalco9@reddit

9B all day. The 35B models are impressive but the hardware requirements put them out of reach for most people running local. A genuinely good 9B that fits in 8GB VRAM would change more workflows than another 35B that needs a 3090.
View on Reddit #79031707

Daniel_H212@reddit

if the 35B is a sparse MoE then its well within reach of anyone with more than 32 GB of RAM.
View on Reddit #79035218

hakanavgin@reddit

Yeah that is exactly what op said, it puts them out of reach for most people running local. Most people use 16 GB RAM these days, even then with Windows, background apps and kv cache, you get no more than 4-6 gigabytes for running models. I've got 16 gigs of VRAM and 16 gigs of RAM so I consider myself above average for total amount of fast memory, and can't run anything more than 14B at 32-48k ctx@q4_k_m at any usable speeds and comfortable memory usage. Most people overestimate the "average guy" these days.
View on Reddit #79051718

Daniel_H212@reddit

35B sparse is also still usable with 16 GB of RAM and 8 GB of VRAM, which wouldn't even be unusual for a system built 8 years ago with an RX 580. I don't think it's out of the reach of most people interested in local LLMs.
View on Reddit #79056889

hakanavgin@reddit

Not really, I may be doing something wrong but even GLM4.7 Flash at q3 is a stretch with all the windows stuff in the background. I'm dual booting Arch as well and the experience is more or less the same. It is either unbearably slow, like sub 10 tok/s or outright doesn't load at all. It is possible to load and somehow use, but at this point you are not doing anything other than inference and it is not a realistic scenario. I think realistically at least 48 gigs of total memory is a must for actually incorporating local llms to any workflow.
View on Reddit #79137717

Anduin1357@reddit

Feels bad when I have more VRAM than people have RAM, and then to have more RAM than people have RAM+VRAM combined. All that to power Qwen3-Next-80B_q6_k_xl at reasonable speeds. I'd consider anyone who can run Qwen3-30B-A3B-Instruct-2507_q4_0 to be average.
View on Reddit #79095370

TheRealMasonMac@reddit

You don't even need much RAM. It was shockingly fast inferring from my NVME drive that I didn't even realize it wasn't using my RAM until I checked.
View on Reddit #79085109

Far-Low-4705@reddit

well, if your looking to purchase a new laptop or PC with this in mind, its really not that big of a deal to go from 16Gb to 32Gb. Id argue 16Gb is not enough for windows anyway, i have 32 and windows uses 20-24Gb in the background. Probably gonna switch to linux tbh, i only have windows cuz i need it for CAD and engineering software
View on Reddit #79080666

Tai9ch@reddit

Do you have a gaming machine that you're running some AI on, or have you intentionally built towards running AI models? Because yes, reasonable gaming setups tend to max out at a single 16GB GPU, which makes a 30-35B model kind of crap. But as soon as you're buying hardware to run AI models, options like Strix Halo or 2x 3090's start to be entirely feasible, and at that point 35B (or even 80B) becomes entirely feasible.
View on Reddit #79047737

Thunderstarer@reddit

On my 16GB RX 9060 XT, Qwen3Coder 30B A3B _just barely_ fits at IQ_3_K_XL with 30K context, and I can get a decently useable inference speed of 115T/s. If I throw my 8GB RX 480 in, I can go up to about 64K context, but inference speed drops to about 55T/s, and prompt processing speed gets absolutely murdered.
View on Reddit #79081365

EnthropicBeing@reddit

Are you implementing ~9B models in any workflow nowadays? I'm genuinely interested since I'm a total amateur and couldn't find any use for them.
View on Reddit #79034980

andy2na@reddit

I keep qwen3-vl:4b iq4 in vram for Frigate image analyzing, home assistant voice assistant, karakeep, open-notebook, and general questions and it works great. For more complicated tasks like Sure Finance to analyze my finances, I'll temporarily load in qwen3-vl:8b-instruct-q4_K_M. Looking forward to 3.5:9b to compare
View on Reddit #79060213

EnthropicBeing@reddit

That's awesome. Frigate default AI engine was more than enough for me. I wonder how your setup worked
View on Reddit #79061468

andy2na@reddit

Connecting a llm to frigate will generate AI descriptions for events. Also in the new 0.17 versions, it'll generate AI summaries for notifications. It elevates frigate to the next level and highly recommended
View on Reddit #79063303

nakedspirax@reddit

The 3090 was released 6 years ago. Maybe it's time to get with the times.
View on Reddit #79032672

IrisColt@reddit

HEH!
View on Reddit #79033943

datfalloutboi@reddit

Low cost and good vram. Not much else you can ask for. The 4090 is another option but those are hard to find for decent prices.
View on Reddit #79033791

jonydevidson@reddit

No idea why you're getting down votes. 5090 is not out of stock, it's just absurdly priced but you gotta pay to play and everyone wants to play this AI game.
View on Reddit #79033730

peregrinefalco9@reddit

The 5090 has been out of stock since launch. Most people are still running 3090s or less - a strong 9B model helps them today, not in theory.
View on Reddit #79032856

nakedspirax@reddit

Get a 3090 like your original comment. It's almost 6 years old now. 4 more years and it's a decade old GPU.
View on Reddit #79033071

AppealThink1733@reddit

And what about qwen 3.5 4B?
View on Reddit #79030495

JumpyAbies@reddit

What about qwen 3.5 0.6B?
View on Reddit #79046446

swagonflyyyy@reddit

Honestly that would open the way for end user on-device NPCs.
View on Reddit #79121747

KaosNutz@reddit

im waiting on this one as well, qwen 3 4b is good enough for web search on open-webui, I just need to setup playwright as fetch\_url can't open some websites
View on Reddit #79071868

AppealThink1733@reddit

I didn't like it for web browsing. The best 4B I found for that purpose was the ZwZ 4B. It's excellent for that.
View on Reddit #79071967

KaosNutz@reddit

I think I'm using the instruct version from the ollama library, base model wasn't so nice. will check that one out
View on Reddit #79072481

hum_ma@reddit

And qwen 3.5 1.7B
View on Reddit #79056503

TeamCaspy@reddit

What about second breakfast?
View on Reddit #79035308

Amazing_Athlete_2265@reddit

What about first breakfast?
View on Reddit #79052714

dances_with_gnomes@reddit

I might be able to run 9B. No way I can run 35B.
View on Reddit #79030571

sloth_cowboy@reddit

Specs? I don't have any input, just curious.
View on Reddit #79048972

dances_with_gnomes@reddit

GeForce GTX 1660 Ti with 6 gb vram and 16 gb of RAM. Ryzen 7 2700X if that matters, I honestly don't know much about these!
View on Reddit #79053435

Daniel_H212@reddit

It's doable if you use IQ2\_XXS 😂
View on Reddit #79120801

Neither-Phone-7264@reddit

i run 30b (granted moe) models on a rtx 3060 laptop which is 6gb + 24gb so its definitely doable with lower quants
View on Reddit #79055633

AlwaysLateToThaParty@reddit

Are you happy enough with the outputs?
View on Reddit #79094654

Neither-Phone-7264@reddit

well they're no opus 4.6 but they're perfectly usable especially for smaller applications like my own deep search agent
View on Reddit #79107591

Significant_Fig_7581@reddit

I think you could offload it partially from your ram, use an IQ3 XXS quant and enjoy the luxury of a 35B model
View on Reddit #79055219

Initial-Argument2523@reddit

Hope we get a new 4B dense for this reason
View on Reddit #79059122

Straight_Abrocoma321@reddit

Same
View on Reddit #79036856

NoobMLDude@reddit

Waiting for Qwen3.5-Coder
View on Reddit #79092138

jacek2023@reddit (OP)

how is this different than Qwen Next Coder? what size do you expect?
View on Reddit #79092244

NoobMLDude@reddit

Would be nice to have 30B MOE size.
View on Reddit #79104436

NoobMLDude@reddit

Would be nice to have 30B MOE size or smaller.
View on Reddit #79104380

ribsdug@reddit

9B is all my poor hardware can run at the moment. 35B is a dream 😭
View on Reddit #79101750

InvDeath@reddit

why 9b?
View on Reddit #79095720

ManufacturerWeird161@reddit

Waiting for the 35B, my 3090 is ready to push batch size to its absolute limit for that model size.
View on Reddit #79095414

Hanselltc@reddit

Honestly both. The big 3.5 has vision, hopefully the small 3.5's also have it. Also looking forward to whatever Gemma is cooking, I really liked the 12B.
View on Reddit #79089336

TokenRingAI@reddit

140B-A15B
View on Reddit #79039429

EbbNorth7735@reddit

Would love this. Playing with Qwen3.3 397B and it's surprisingly fast. However A15B is a bit high for the spares MoE's they've been making. Maybe a A12B or even A8B/A9B.
View on Reddit #79088079

ttkciar@reddit

I am curious: Why 140B specifically? Is there a GPU configuration for which 140B is optimal use of VRAM?
View on Reddit #79068753

TokenRingAI@reddit

RTX 6000, model would be ~ 70-80GB at FP4
View on Reddit #79072091

jacek2023@reddit (OP)

that would be awesome
View on Reddit #79039581

kasinjsh@reddit

xx-A3B-xx
View on Reddit #79086520

Conscious_School6035@reddit

As someone running local LLMs daily, I'd take a well-optimized 9B over a demanding 35B any day. Accessibility matters more than raw power for most users!
View on Reddit #79082525

Turkino@reddit

35b since it's not as often to get anything in the 70b range these days.
View on Reddit #79038636

SpicyWangz@reddit

Next was 80b. That’s pretty close to 70
View on Reddit #79081377

charles25565@reddit

1.5B-ish and 0.5B-ish :D
View on Reddit #79081207

silenceimpaired@reddit

Neither. 100-200b. And they won’t be coming.
View on Reddit #79032911

zipzag@reddit

They do need an MOE between 80B and the older 235B.
View on Reddit #79080579

jacek2023@reddit (OP)

well, I wish the 80B would get updated
View on Reddit #79032961

Ylsid@reddit

Whichever runs on my 3090
View on Reddit #79080313

CarlCarlton@reddit

goekdeniz-guelmez\_josiefied-qwen3.5-9b-abliterated-v1
View on Reddit #79080049

swagonflyyyy@reddit

35b
View on Reddit #79079958

DeepOrangeSky@reddit

I wonder if maybe Qwen3.5 35b accidentally got eaten by hippos. Maybe if it still doesn't get released in the next day or two, meaning we can be pretty sure that is what happened, we can all hold a candlelight vigil in remembrance of what a nice, wonderful local AI model it could have been, if it hadn't met such a tragic and untimely demise. Maybe people can come up with some poems or song lyrics that we can quietly chant when we hold our candlelight vigils in memory of Qwen3.5 35b. If it turns out that its slightly mentally challenged brother, Qwen3.5 9b also got eaten, then we can hold vigils for that as well, although that would be so tragic that we should not speak of such possibilities for now. Most likely it is just playing on the rainbow farm where your pet dog went on a super long vacation and you never saw it again when it got old. So, once it finds its way back from the rainbow farm, all will be well.
View on Reddit #79077389

kwinz@reddit

why no 120B ?
View on Reddit #79069969

RayHell666@reddit

Me waiting for their next vision model...
View on Reddit #79069542

ciprianveg@reddit

a 235-300b with VL model will fit perfectly on my 8x3090 setup.. the 398b one forces me to buy more gpus..
View on Reddit #79061383

Its_Powerful_Bonus@reddit

Love that amount of vram. Qwen3.5 works like a charm with unsloth iq3_xxs and context quantization set to q8. Even RoPe for 512k worked in koboldcpp. Im running 2x rtx 6000 pro.
View on Reddit #79069251

Its_Powerful_Bonus@reddit

100b-200b a10b multimodal with 1M context which is memory efficient. Waiting for Nemotron 3 Super 100b a10b, but hope that other teams will also go this way
View on Reddit #79068880

DeepOrangeSky@reddit

Some more dense models between 30b and 120b would be awesome. If they decide to skip the medium sized dense models this time around (which would be a huge shame, but wouldn't surprise me, given how things have been trending), then some not-so-sparse MoE like a 100b a10b or 70b a8b or something might be interesting (not sure if it would do what I think it could do, or if it would be a bad idea, but, I dunno, maybe it would be awesome, lol)
View on Reddit #79050517

Its_Powerful_Bonus@reddit

Dense model are not power efficient, long context costs a lot. Everything which is important for larger scale deployments are hard to get with dense models.
View on Reddit #79068720

ttkciar@reddit

Yep, a dense model in the 12B-to-14B range would be great for folks with 16GB VRAM, and a dense model in the 24B-to-32B range would be great for 32GB VRAM.
View on Reddit #79068549

toothpastespiders@reddit

>which would be a huge shame, but wouldn't surprise me, given how things have been trending Yeah, I think even more than wanting to actually use a new mid-sized dense model from Qwen I'd like to see it simply as a suggestion that the industry as a whole hasn't dropped them for MoEs.
View on Reddit #79064571

singhapura@reddit

Nothing stops you making your own.
View on Reddit #79050682

ttkciar@reddit

Nothing, except costing more $$$ than a luxury sedan.
View on Reddit #79068354

beedunc@reddit

Yes.
View on Reddit #79068272

selnatic@reddit

35B, and its not even close. 9B is cute for quick local tinkering, but it hits that "sounds confident, misses the point" wall the second you want real reasoning or tool use. A good 35B quant on a 24GB card (or split across CPU/GPU with llama.cpp) is where it starts feeling like an actual assistant instead of autocomplete. The people hyping 9B are mostly just flexing that it runs on a laptop.
View on Reddit #79068111

ALittleBitEver@reddit

Waiting for Qwen 3.5 4B 💀
View on Reddit #79066233

alexp702@reddit

A draft model for 397b!
View on Reddit #79065923

Far-Low-4705@reddit

Qwen 3.5 80b …Hopefully with vision
View on Reddit #79065650

venkada_321@reddit

0.6b less goooo. Mobile users
View on Reddit #79065635

wanderer_4004@reddit

I am waiting for a flexi model that automatically adjusts from 8-80B and from A1B to A10B and also switching between thinking and non-thinking depending on the task at hand, the available memory and the available hardware. I.e. given a simple task it behaves like a 8B1 model, and given a difficult task it behaves like 80B A10B with thinking. In the latter case it will use itself in 8B1 for speculative decoding.
View on Reddit #79065083

m_mukhtar@reddit

35b for sure. I wish they creat one with a bit more active parameters. So.ething like 70b with A5b as i think the a active part affects intellegance more that the total parameters which affects knowladge more (not a a clear black and white for sure but a gemeral observation)
View on Reddit #79034058

toothpastespiders@reddit

>not a a clear black and white for sure but a general observation So far the only mid-size MoE that doesn't have that idiot savant feel to me is Air with 106b 12a.
View on Reddit #79064991

xandep@reddit

Probably tomorrow. Source: my head. But seriously, Monday is a hot day for model releases. 
View on Reddit #79032865

t_krett@reddit

Please don't say that, I am tempted to wait for the Monday morning sun to rise on China
View on Reddit #79063335

vovxbroblox@reddit

0.2b, i need to feed my rpi 2w zero.
View on Reddit #79061210

LegacyRemaster@reddit

Since I have Qwen3.5-397B-A17B-UD I can finally stop using non-local LLMs.
View on Reddit #79030545

joblesspirate@reddit

I finally got it working since they patched the llama.cpp bug. I love it!
View on Reddit #79061026

Darth_Ender_Ro@reddit

Better Q: what are you using them for?
View on Reddit #79060798

Adventurous-Paper566@reddit

35B, si ça rentre en Q6 dans 32Gb de RAM avec un contexte > 8k
View on Reddit #79032169

Initial-Argument2523@reddit

If we get a new 32B dense it could potentially be quite interesting to prune it down to 24B
View on Reddit #79060104

MerePotato@reddit

Yes
View on Reddit #79059646

DayshareLP@reddit

20b would be nice
View on Reddit #79059530

MerePotato@reddit

35B any day, 24GB VRAM is the consumer hardware sweet spot
View on Reddit #79059366

stoppableDissolution@reddit

None, tbh. Qwen models have been raw disappointment since 2.5 and qwq.
View on Reddit #79035669

cdshift@reddit

For what usecases? Qwen3 coder next is s daily drive for me on my local setup with open code
View on Reddit #79037259

stoppableDissolution@reddit

I gave up on local coding long ago, so idk about that. But for things like classification/ranking/synthetic data generation/etc qwen is kinda sad compared to heavily quantized mistral large or glm air or gemma. Other case is rp, but it never even entered competition there.
View on Reddit #79037464

Tai9ch@reddit

Local coding works fine now, starting at about 48GB of VRAM.
View on Reddit #79047870

stoppableDissolution@reddit

Its sloooooow and significantly worse than claude. Like, theres nothing private about my code that is being pushed on public github anyway, so why use worse product?
View on Reddit #79052899

Tai9ch@reddit

Being slow is an issue with your hardware specs. How well it performs is an issue with model choice and workflow. Sure, you're not going to get the same one-shot performance from an 80B model like Qwen3-Coder-Next that you'd get from a 400B+ model like Claude Opus. But there are certainly several open models in the same broad capability class as the frontier proprietary models, and there's a pretty smooth gradient from even 30B models up to 1T models with all of them being useful for coding.
View on Reddit #79057426

stoppableDissolution@reddit

Well, sure, big glm is good enough to replace closed models, but nothing that can run in 48gb is remotely useful in my opinion. It takes more effort to make 80b qwen or, idk, 32b glm produce good-enough code than to write it myself.
View on Reddit #79057902

Tai9ch@reddit

I've had good results from 80B Qwen, and that'll run in 48GB at Q4 even with a reasonable amount of context, and it'll be *fast* on something like a pair of 3090s. Of course bigger is better. And bigger is entirely achievable with a little bit of effort (and a moderate to gargantuan amount of money).
View on Reddit #79059089

GraybeardTheIrate@reddit

Definitely interested in a 35B, especially if it's dense.
View on Reddit #79058641

10minOfNamingMyAcc@reddit

Personally, 35B but 9B doesn't sound too bad either.
View on Reddit #79055093

PANIC_EXCEPTION@reddit

9B because it would be amazing to see it work on my phone. My laptop can already run Qwen-Coder-Next 80B and it works really well for general purpose as well.
View on Reddit #79054972

Cool-Chemical-5629@reddit

https://i.redd.it/u5jr2o6bi3lg1.gif
View on Reddit #79054766

SuchAGoodGirlsDaddy@reddit

Honestly a SOTA 9B would be big for me right now. Of course I’ll happily wait for TheDrummer to get ahold of it.
View on Reddit #79049843

jacek2023@reddit (OP)

Are there any qwen finetunes from u/TheLocalDrummer?
View on Reddit #79050843

Lesser-than@reddit

9b just because I know it will fit on anything I own, I get excited for just about anything qwen though, as they continue to set a solid groundwork for the future of llms.
View on Reddit #79050781

Opening-Ad6258@reddit

9b because I can actually run it
View on Reddit #79050316

DenZNK@reddit

Share how you use it pls. I can't understand why I would need it, since I use cloud services. I have an RTX 5080. What tasks could it be used for besides STT or TTS?
View on Reddit #79047773

jacek2023@reddit (OP)

This sub is about using LLMs locally, not in the cloud
View on Reddit #79047843

DenZNK@reddit

That's why I'm asking what it will be used for, in case I need it, since my video card is currently only used for gaming :)
View on Reddit #79049086

jacek2023@reddit (OP)

that's a really big topic, but basically if you like spending time on gaming and don't want to learn new things then probably local LLM won't be interesting for you
View on Reddit #79049230

DenZNK@reddit

I haven't been playing much lately—I haven't turned anything on for over a month and spend most of my time vibing coding :)
View on Reddit #79049729

jacek2023@reddit (OP)

then explore this sub
View on Reddit #79049962

deathentry@reddit

Will 9B work with 8GB VRAM? I can only have 35k context window which means I can't even work angular mcp 🤣 😅
View on Reddit #79039447

sciencewarrior@reddit

It should be about 5GB with a 4-bit quantization, leaving a couple GB for a decent context size.
View on Reddit #79047933

JumpyAbies@reddit

But he just launched... Soon someone else will be crying about the 7b, 3b, 0.6b, 1m, 1k
View on Reddit #79046357

Septerium@reddit

Both would be cute toys to play with
View on Reddit #79045738

-InformalBanana-@reddit

35B only if it is a MOE, otherwise 9B. But for me 80 or 90BA3B would be good MOE, cause I have 96GB ram. Or maybe they should try A4B MOE cause Qwen 4B has good performance for it size so maybe that would translate good into MOE, hopefully that won't slow the model down too much.
View on Reddit #79045469

Look_0ver_There@reddit

I'll take a 120B one thanks!
View on Reddit #79043227

Zestyclose-Shift710@reddit

35b if moe
View on Reddit #79043075

YoussofAl@reddit

4B. You’re all sleeping on 4B 2507. My favourite model.
View on Reddit #79042515

Ardalok@reddit

Personally, I'm looking forward to Gemma 4 more.
View on Reddit #79040920

jeekp@reddit

nemotron 3 super nvfp4 on llama.cpp
View on Reddit #79040829

LushHappyPie@reddit

7B to 12B with Test Time Training. I couldn't care less about 5% stronger reasoning or 7% stronger agentic performance in a local model.
View on Reddit #79040549

pigeon57434@reddit

35B definitely
View on Reddit #79039968

johnmacleod99@reddit

9B
View on Reddit #79039514

NullKalahar@reddit

9b
View on Reddit #79038353

cruzanstx@reddit

35b
View on Reddit #79037613

DrNavigat@reddit

As long as they aren't thinking models that waste my hardware with tokens that barely alter the final answer and clutter my context...
View on Reddit #79037450

ab2377@reddit

there should be a menu, like in the restaurants, "what parameters count will you like to have?", you click 9, "your order will be served in 5 minutes", you click download after 5 minutes.
View on Reddit #79037236

Conscious_Nobody9571@reddit

9B pls
View on Reddit #79037168

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #79036149

Black-Mack@reddit

Qwen 3.5 1.5B
View on Reddit #79035570

bene_42069@reddit

9B prolly the only one I can run with my laptop lmao
View on Reddit #79034658

FantasticProcedure46@reddit

Qwen3.5-VL-9B
View on Reddit #79032703

ilintar@reddit

3.5 is VL by default.
View on Reddit #79034559

Alby407@reddit

None of them.
View on Reddit #79034451

Single_Ring4886@reddit

I will look like some kind of Qwen fanboy but I must say that as opensource models go their is best. It feels like their models are well balanced not obsesed with just coding like glm or kimi etc. Maybe new DS will be good but then again it will have 700B
View on Reddit #79034377

Confident-Aerie-6222@reddit

A good 4B multilingual model that beats gemma models at translation abilities and is also good at logic, thinking and coding.
View on Reddit #79034196

dampflokfreund@reddit

35B A3B. Probably a lot better than 9B and still fast enough.
View on Reddit #79034028

Slow_Concentrate3831@reddit

Between 14b and 20b would be cool
View on Reddit #79033960

__JockY__@reddit

235B!
View on Reddit #79033170

somkomomko@reddit

I have a 36gb MacBook sadly it doesn't fit 32b for anything useful and inference is so slow
View on Reddit #79032854

jacek2023@reddit (OP)

you should compare to 30B A3B
View on Reddit #79032908

Zyj@reddit

The 397B works ok
View on Reddit #79032774

mehhxx@reddit

I am keeping my hopes up for an extensive list of options just like Qwen 3 was, as even a 0.6b reasoning model would come in incredibly handy for very low-end devices and edge cases.
View on Reddit #79032305

rawednylme@reddit

I want both, equally.
View on Reddit #79032260

LivingHighAndWise@reddit

35B
View on Reddit #79032207

MDSExpro@reddit

Minimax-M2.5 REAP AWQ so 128GB of VRAM is enough to get that running with full context.
View on Reddit #79031806

Malfun_Eddie@reddit

Split in the middle 14b - 16b is perfect for 16GB VRAM.
View on Reddit #79031769

sunshinecheung@reddit

9b
View on Reddit #79031293

Glxblt76@reddit

The small one I can run on laptop.
View on Reddit #79031253

Paramecium_caudatum_@reddit

Qwhen gguf /s
View on Reddit #79029610

jacek2023@reddit (OP)

Llama.cpp support for Owen 3.5 has been merged many days ago
View on Reddit #79030611

Paramecium_caudatum_@reddit

Thank you for your reply!
View on Reddit #79031054

ExcitementSubject361@reddit

Love to See New QwQ 32b ...but i think we dont get it 
View on Reddit #79029886

sleepingsysadmin@reddit

im hoping 35b thinking is released and it scores \~25% or so on term bench hard.
View on Reddit #79029728