Have you tried Qwen3-Next-80B-A3B or Qwen3-Coder-Next-80B-A3B with linear attention? The whole model fits in 24GB VRAM, and runs at ~50 t/s on my PC, so even if part of it spills over into RAM, it will probably still be fast enough to be very usable.
I am new to local llm. Started playing with it today. Haven't figured out fine tuning and intricacies (thinking vs non-thinking, A3B being 3B active despite 35b total and stuff).
My current environment is Llama.cpp + openwebui. AMD 7700XT (12gb vram) + 32gb ddr5 ram.
What would you say are the best models I should try, experiment with?
Qwen 3.5 9b, qwen3.5 35 a3b, glm, open oss, etc?
Sorry for info requesting like this. I just saw that we have similar hardware. Thank you.
Thanks. I'm going to try an mxfp4 quant and see if it works better than the q4_k_s from unsloth.
But it's still a 4-bit quantization so it will require 40GB just for the weights alone... definitely won't fit in 24GB vram. I'm impressed you're getting 50 t/s on your pc. What is your hardware?
I have no idea what hardware he has, but launch claude code, codex , or qwen with free plan in your llama.cpp directory, tell it to read information from this discussion (just give it URL) and unsloth page for the model (like [https://unsloth.ai/docs/models/qwen3-coder-next](https://unsloth.ai/docs/models/qwen3-coder-next) ), allow it to search on web and discuss how to best configure and compile your version of llama.cpp and later run ./build/llama-bench on your particular hardware. Just use llms via API to bootstrap your local llm.
Use MoE models around 70B parameters and offload weights to cpu. I'm assuming you are using Qwen3.5 27B, which is dense. Dense will start slow and slow down faster.
If you're mentioning the Qwen next instruct from last year? I think this one is already an update for that one, I still hope they get us another one but especially something around the 60B size cause I really want people with 16gb of ram and 16gb vram to experience how good Qwen can be even in low quants though I'm certain the 35B would be great too...
60B is also a good size, no doubt about it. But, I think they might release 2 dense models first: A 9B and 32B. I think with the hybrid attention, a Q4 of the 32B can fit in 24Vram with 127K context size.
yeah thats what i was thinking, though 5b sounds very big for a vision encoder tho lol
I think this is the first nativley multimodal model (and probably one of the first big name local model) so im hoping the vision is crazy good!
Tbh, qwen 3vl is already has basically perfect vision (considering their size) imo, so im super excited for 3.5 since i use these models for engineering lol
Ppl with 16Gb of ram are very unlikely to have 16Gb of VRAM.. also u wouldn’t be able to fit any context.
More likely 16Gb RAM with 6-8Gb VRAM, maybe 12. For those 8b dense or 30b MOE is best.
I think 32Gb RAM and 16Gb VRAM is much more common, and where 80b is best.
Same. I'm literally checking like 4 times a day. Finally being able to do coding and vision with one model is gonna unlock one of my gpus for other tasks.
Qwen3.5-coder-80b would make me moist.
Yeah. 😞They are in hugging face now too. I guess I'll try to do with 122b at q3. 😬I really don't like using less than q4_k_m and I was hyping to do q6. Maybe they will release an 80b down the line similar to Next.
yeah, that'd be super nice, but i doubt it..
dang that sucks, 80b would have been IDEAL for me. but those extra 42b parameters are just too much.
Gonna be hard to replace the qwen 3 next models for now ig.
Honestly, a 1.7B one. It is crazy how good the Qwen3 version of it is. Can be run on CPU if you need cheap and can be run on GPU if you need fast. Can correctly summarize, extract and classify multilingual texts (EU languages) while correctly following instructions. I found it to be the perfect ratio of size to quality.
9B all day. The 35B models are impressive but the hardware requirements put them out of reach for most people running local. A genuinely good 9B that fits in 8GB VRAM would change more workflows than another 35B that needs a 3090.
Yeah that is exactly what op said, it puts them out of reach for most people running local. Most people use 16 GB RAM these days, even then with Windows, background apps and kv cache, you get no more than 4-6 gigabytes for running models.
I've got 16 gigs of VRAM and 16 gigs of RAM so I consider myself above average for total amount of fast memory, and can't run anything more than 14B at 32-48k ctx@q4_k_m at any usable speeds and comfortable memory usage. Most people overestimate the "average guy" these days.
35B sparse is also still usable with 16 GB of RAM and 8 GB of VRAM, which wouldn't even be unusual for a system built 8 years ago with an RX 580. I don't think it's out of the reach of most people interested in local LLMs.
Not really, I may be doing something wrong but even GLM4.7 Flash at q3 is a stretch with all the windows stuff in the background. I'm dual booting Arch as well and the experience is more or less the same. It is either unbearably slow, like sub 10 tok/s or outright doesn't load at all.
It is possible to load and somehow use, but at this point you are not doing anything other than inference and it is not a realistic scenario. I think realistically at least 48 gigs of total memory is a must for actually incorporating local llms to any workflow.
Feels bad when I have more VRAM than people have RAM, and then to have more RAM than people have RAM+VRAM combined.
All that to power Qwen3-Next-80B_q6_k_xl at reasonable speeds.
I'd consider anyone who can run Qwen3-30B-A3B-Instruct-2507_q4_0 to be average.
well, if your looking to purchase a new laptop or PC with this in mind, its really not that big of a deal to go from 16Gb to 32Gb.
Id argue 16Gb is not enough for windows anyway, i have 32 and windows uses 20-24Gb in the background.
Probably gonna switch to linux tbh, i only have windows cuz i need it for CAD and engineering software
Do you have a gaming machine that you're running some AI on, or have you intentionally built towards running AI models?
Because yes, reasonable gaming setups tend to max out at a single 16GB GPU, which makes a 30-35B model kind of crap.
But as soon as you're buying hardware to run AI models, options like Strix Halo or 2x 3090's start to be entirely feasible, and at that point 35B (or even 80B) becomes entirely feasible.
On my 16GB RX 9060 XT, Qwen3Coder 30B A3B _just barely_ fits at IQ_3_K_XL with 30K context, and I can get a decently useable inference speed of 115T/s. If I throw my 8GB RX 480 in, I can go up to about 64K context, but inference speed drops to about 55T/s, and prompt processing speed gets absolutely murdered.
I keep qwen3-vl:4b iq4 in vram for Frigate image analyzing, home assistant voice assistant, karakeep, open-notebook, and general questions and it works great. For more complicated tasks like Sure Finance to analyze my finances, I'll temporarily load in qwen3-vl:8b-instruct-q4_K_M. Looking forward to 3.5:9b to compare
Connecting a llm to frigate will generate AI descriptions for events. Also in the new 0.17 versions, it'll generate AI summaries for notifications. It elevates frigate to the next level and highly recommended
No idea why you're getting down votes. 5090 is not out of stock, it's just absurdly priced but you gotta pay to play and everyone wants to play this AI game.
im waiting on this one as well, qwen 3 4b is good enough for web search on open-webui, I just need to setup playwright as fetch\_url can't open some websites
Honestly both. The big 3.5 has vision, hopefully the small 3.5's also have it. Also looking forward to whatever Gemma is cooking, I really liked the 12B.
Would love this. Playing with Qwen3.3 397B and it's surprisingly fast. However A15B is a bit high for the spares MoE's they've been making. Maybe a A12B or even A8B/A9B.
As someone running local LLMs daily, I'd take a well-optimized 9B over a demanding 35B any day. Accessibility matters more than raw power for most users!
I wonder if maybe Qwen3.5 35b accidentally got eaten by hippos.
Maybe if it still doesn't get released in the next day or two, meaning we can be pretty sure that is what happened, we can all hold a candlelight vigil in remembrance of what a nice, wonderful local AI model it could have been, if it hadn't met such a tragic and untimely demise.
Maybe people can come up with some poems or song lyrics that we can quietly chant when we hold our candlelight vigils in memory of Qwen3.5 35b.
If it turns out that its slightly mentally challenged brother, Qwen3.5 9b also got eaten, then we can hold vigils for that as well, although that would be so tragic that we should not speak of such possibilities for now. Most likely it is just playing on the rainbow farm where your pet dog went on a super long vacation and you never saw it again when it got old. So, once it finds its way back from the rainbow farm, all will be well.
Love that amount of vram. Qwen3.5 works like a charm with unsloth iq3_xxs and context quantization set to q8. Even RoPe for 512k worked in koboldcpp. Im running 2x rtx 6000 pro.
100b-200b a10b multimodal with 1M context which is memory efficient. Waiting for Nemotron 3 Super 100b a10b, but hope that other teams will also go this way
Some more dense models between 30b and 120b would be awesome.
If they decide to skip the medium sized dense models this time around (which would be a huge shame, but wouldn't surprise me, given how things have been trending), then some not-so-sparse MoE like a 100b a10b or 70b a8b or something might be interesting (not sure if it would do what I think it could do, or if it would be a bad idea, but, I dunno, maybe it would be awesome, lol)
Dense model are not power efficient, long context costs a lot. Everything which is important for larger scale deployments are hard to get with dense models.
Yep, a dense model in the 12B-to-14B range would be great for folks with 16GB VRAM, and a dense model in the 24B-to-32B range would be great for 32GB VRAM.
>which would be a huge shame, but wouldn't surprise me, given how things have been trending
Yeah, I think even more than wanting to actually use a new mid-sized dense model from Qwen I'd like to see it simply as a suggestion that the industry as a whole hasn't dropped them for MoEs.
35B, and its not even close. 9B is cute for quick local tinkering, but it hits that "sounds confident, misses the point" wall the second you want real reasoning or tool use. A good 35B quant on a 24GB card (or split across CPU/GPU with llama.cpp) is where it starts feeling like an actual assistant instead of autocomplete. The people hyping 9B are mostly just flexing that it runs on a laptop.
I am waiting for a flexi model that automatically adjusts from 8-80B and from A1B to A10B and also switching between thinking and non-thinking depending on the task at hand, the available memory and the available hardware. I.e. given a simple task it behaves like a 8B1 model, and given a difficult task it behaves like 80B A10B with thinking. In the latter case it will use itself in 8B1 for speculative decoding.
35b for sure. I wish they creat one with a bit more active parameters. So.ething like 70b with A5b as i think the a active part affects intellegance more that the total parameters which affects knowladge more (not a a clear black and white for sure but a gemeral observation)
>not a a clear black and white for sure but a general observation
So far the only mid-size MoE that doesn't have that idiot savant feel to me is Air with 106b 12a.
I gave up on local coding long ago, so idk about that.
But for things like classification/ranking/synthetic data generation/etc qwen is kinda sad compared to heavily quantized mistral large or glm air or gemma. Other case is rp, but it never even entered competition there.
Its sloooooow and significantly worse than claude. Like, theres nothing private about my code that is being pushed on public github anyway, so why use worse product?
Being slow is an issue with your hardware specs. How well it performs is an issue with model choice and workflow.
Sure, you're not going to get the same one-shot performance from an 80B model like Qwen3-Coder-Next that you'd get from a 400B+ model like Claude Opus.
But there are certainly several open models in the same broad capability class as the frontier proprietary models, and there's a pretty smooth gradient from even 30B models up to 1T models with all of them being useful for coding.
Well, sure, big glm is good enough to replace closed models, but nothing that can run in 48gb is remotely useful in my opinion. It takes more effort to make 80b qwen or, idk, 32b glm produce good-enough code than to write it myself.
I've had good results from 80B Qwen, and that'll run in 48GB at Q4 even with a reasonable amount of context, and it'll be *fast* on something like a pair of 3090s.
Of course bigger is better. And bigger is entirely achievable with a little bit of effort (and a moderate to gargantuan amount of money).
9B because it would be amazing to see it work on my phone. My laptop can already run Qwen-Coder-Next 80B and it works really well for general purpose as well.
9b just because I know it will fit on anything I own, I get excited for just about anything qwen though, as they continue to set a solid groundwork for the future of llms.
Share how you use it pls. I can't understand why I would need it, since I use cloud services. I have an RTX 5080. What tasks could it be used for besides STT or TTS?
that's a really big topic, but basically if you like spending time on gaming and don't want to learn new things then probably local LLM won't be interesting for you
35B only if it is a MOE, otherwise 9B.
But for me 80 or 90BA3B would be good MOE, cause I have 96GB ram.
Or maybe they should try A4B MOE cause Qwen 4B has good performance for it size so maybe that would translate good into MOE, hopefully that won't slow the model down too much.
there should be a menu, like in the restaurants, "what parameters count will you like to have?", you click 9, "your order will be served in 5 minutes", you click download after 5 minutes.
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
I will look like some kind of Qwen fanboy but I must say that as opensource models go their is best. It feels like their models are well balanced not obsesed with just coding like glm or kimi etc. Maybe new DS will be good but then again it will have 700B
I am keeping my hopes up for an extensive list of options just like Qwen 3 was, as even a 0.6b reasoning model would come in incredibly handy for very low-end devices and edge cases.
220 Comments
Kat-@reddit
social_tech_10@reddit
cristoper@reddit
Xantrk@reddit
huseynli@reddit
social_tech_10@reddit
cristoper@reddit
agoldin@reddit
techmago@reddit
Firm_Meeting6350@reddit
Far-Low-4705@reddit
Kat-@reddit
-InformalBanana-@reddit
Significant_Fig_7581@reddit
Iory1998@reddit
Fresh_Finance9065@reddit
Beautiful_Egg6188@reddit
Fresh_Finance9065@reddit
Significant_Fig_7581@reddit
Iory1998@reddit
Significant_Fig_7581@reddit
Iory1998@reddit
Far-Low-4705@reddit
Iory1998@reddit
Far-Low-4705@reddit
Iory1998@reddit
AlwaysLateToThaParty@reddit
Far-Low-4705@reddit
finah1995@reddit
Significant_Fig_7581@reddit
Far-Low-4705@reddit
hesperaux@reddit
Far-Low-4705@reddit
hesperaux@reddit
Far-Low-4705@reddit
Significant_Fig_7581@reddit
Far-Low-4705@reddit
journalofassociation@reddit
shortfinal@reddit
Far-Low-4705@reddit
IMightBeAlpharius@reddit
jinnyjuice@reddit
insmek@reddit
Funny_Working_7490@reddit
Daniel_H212@reddit
MoffKalast@reddit
zipzag@reddit
megacewl@reddit
CriticismNo3570@reddit
Open-Raise-6676@reddit
vhthc@reddit
ttkciar@reddit
vhthc@reddit
ttkciar@reddit
Traditional-Card6096@reddit
Material-Ad5426@reddit
Tai9ch@reddit
hesperaux@reddit
MinusKarma01@reddit
peregrinefalco9@reddit
Daniel_H212@reddit
hakanavgin@reddit
Daniel_H212@reddit
hakanavgin@reddit
Anduin1357@reddit
TheRealMasonMac@reddit
Far-Low-4705@reddit
Tai9ch@reddit
Thunderstarer@reddit
EnthropicBeing@reddit
andy2na@reddit
EnthropicBeing@reddit
andy2na@reddit
nakedspirax@reddit
IrisColt@reddit
datfalloutboi@reddit
jonydevidson@reddit
peregrinefalco9@reddit
nakedspirax@reddit
AppealThink1733@reddit
JumpyAbies@reddit
swagonflyyyy@reddit
KaosNutz@reddit
AppealThink1733@reddit
KaosNutz@reddit
hum_ma@reddit
TeamCaspy@reddit
Amazing_Athlete_2265@reddit
dances_with_gnomes@reddit
sloth_cowboy@reddit
dances_with_gnomes@reddit
Daniel_H212@reddit
Neither-Phone-7264@reddit
AlwaysLateToThaParty@reddit
Neither-Phone-7264@reddit
Significant_Fig_7581@reddit
Initial-Argument2523@reddit
Straight_Abrocoma321@reddit
NoobMLDude@reddit
jacek2023@reddit (OP)
NoobMLDude@reddit
NoobMLDude@reddit
ribsdug@reddit
InvDeath@reddit
ManufacturerWeird161@reddit
Hanselltc@reddit
TokenRingAI@reddit
EbbNorth7735@reddit
ttkciar@reddit
TokenRingAI@reddit
jacek2023@reddit (OP)
kasinjsh@reddit
Conscious_School6035@reddit
Turkino@reddit
SpicyWangz@reddit
charles25565@reddit
silenceimpaired@reddit
zipzag@reddit
jacek2023@reddit (OP)
Ylsid@reddit
CarlCarlton@reddit
swagonflyyyy@reddit
DeepOrangeSky@reddit
kwinz@reddit
RayHell666@reddit
ciprianveg@reddit
Its_Powerful_Bonus@reddit
Its_Powerful_Bonus@reddit
DeepOrangeSky@reddit
Its_Powerful_Bonus@reddit
ttkciar@reddit
toothpastespiders@reddit
singhapura@reddit
ttkciar@reddit
beedunc@reddit
selnatic@reddit
ALittleBitEver@reddit
alexp702@reddit
Far-Low-4705@reddit
venkada_321@reddit
wanderer_4004@reddit
m_mukhtar@reddit
toothpastespiders@reddit
xandep@reddit
t_krett@reddit
vovxbroblox@reddit
LegacyRemaster@reddit
joblesspirate@reddit
Darth_Ender_Ro@reddit
Adventurous-Paper566@reddit
Initial-Argument2523@reddit
MerePotato@reddit
DayshareLP@reddit
MerePotato@reddit
stoppableDissolution@reddit
cdshift@reddit
stoppableDissolution@reddit
Tai9ch@reddit
stoppableDissolution@reddit
Tai9ch@reddit
stoppableDissolution@reddit
Tai9ch@reddit
GraybeardTheIrate@reddit
10minOfNamingMyAcc@reddit
PANIC_EXCEPTION@reddit
Cool-Chemical-5629@reddit
SuchAGoodGirlsDaddy@reddit
jacek2023@reddit (OP)
Lesser-than@reddit
Opening-Ad6258@reddit
DenZNK@reddit
jacek2023@reddit (OP)
DenZNK@reddit
jacek2023@reddit (OP)
DenZNK@reddit
jacek2023@reddit (OP)
deathentry@reddit
sciencewarrior@reddit
JumpyAbies@reddit
Septerium@reddit
-InformalBanana-@reddit
Look_0ver_There@reddit
Zestyclose-Shift710@reddit
YoussofAl@reddit
Ardalok@reddit
jeekp@reddit
LushHappyPie@reddit
pigeon57434@reddit
johnmacleod99@reddit
NullKalahar@reddit
cruzanstx@reddit
DrNavigat@reddit
ab2377@reddit
Conscious_Nobody9571@reddit
WithoutReason1729@reddit
Black-Mack@reddit
bene_42069@reddit
FantasticProcedure46@reddit
ilintar@reddit
Alby407@reddit
Single_Ring4886@reddit
Confident-Aerie-6222@reddit
dampflokfreund@reddit
Slow_Concentrate3831@reddit
__JockY__@reddit
somkomomko@reddit
jacek2023@reddit (OP)
Zyj@reddit
mehhxx@reddit
rawednylme@reddit
LivingHighAndWise@reddit
MDSExpro@reddit
Malfun_Eddie@reddit
sunshinecheung@reddit
Glxblt76@reddit
Paramecium_caudatum_@reddit
jacek2023@reddit (OP)
Paramecium_caudatum_@reddit
ExcitementSubject361@reddit
sleepingsysadmin@reddit