FOR ME, Qwen3.5-27B is better than Gemini 3.1 Pro and GPT-5.3 Codex

[-]

Specialist_Golf8133@reddit

qwen honestly punches way above its weight for the size. the fact that a 27b model can hang with frontier stuff for specific tasks is kinda the whole point of why local matters. what are you using it for mostly, curious if you're seeing the same gap i am on reasoning vs pure retrieval tasks

[-]

EffectiveCeilingFan@reddit (OP)

I use Copilot on the required school assignments, which are pretty simple stuff like todo list apps, snake games, basic REPL, etc. As for my personal use, Qwen3.5-27B has been handling software packaging for Fedora, driver development in C, webapp development in TS (SvelteKit), and dataset curation & model fine-tuning in Python.

[-]

ntrp@reddit

You use copilot for assignments? What's the point even going to school then

[-]

EffectiveCeilingFan@reddit (OP)

I'm forced to use GitHub Copilot in university

I'm required to use it by my professor. I have to submit a screenshot of my premium requests usage to Piazza every week to prove that I'm using Copilot.

[-]

ntrp@reddit

This is crazy, basically your professor forces you to vibe code and not learn anything? What is he grading, you prompt engineering skills or simply if you had enough credits for opus this month?

[-]

bigrealaccount@reddit

Every industry is using AI, and every software engineer has a model like Claude/Codex provided for them. The professor clearly knows exactly what they're doing, and they most likely also have students do writeups/questions about the code to ensure they understand it and don't just copy paste.

Honestly, if you don't see how this is a good thing for the students, you're the vibe coder here.

sincerely, a software engineer

[-]

ntrp@reddit

I'll pick a senior engineer using ai tools any day than a vibe coder and by not teaching the students how to code he is building vibe coders

[-]

bigrealaccount@reddit

All he said is that his professor is making them use AI tools, not that he wants them to ship 100% of code with it without understanding. Every software engineer is using these tools and the whole point of education is to prepare you for the workforce.

The only person who mentioned vibe coding is you. Nobody is saying vibe coding isn't bad.

Learn to read and stop getting so emotional.

[-]

Becbienzen@reddit

Je pense qu'il est essentiel d'apprendre aux étudiants de savoir utiliser l'IA.
L'apprentissage du développement est tout autre. Et la pédagogie doit s'adapter pour s'assurer que ces étudiants apprennent bien ces deux facettes et les retiennent.

l'IA semble avoir de beaux jours devant elle. Apprendre ses usages est important afin que le marché du travail soit de nouveau ouvert à l'embauche de junior...
Je ne vois rien d'autre qui pourrait sauver l'emploi des juniors. Tous domaines confondus.

Sans compter que les compétences en automatisations des process vont être très rapidement recherché.

[-]

Mediocre_Paramedic22@reddit

I use qwen3.5 122b locally and find it to be very useful and capable.

[-]

hawseepoo@reddit

Same here. 27B is too slow on my HX 370 so I have to use 122B A10B, wish I had something a bit faster

[-]

Mediocre_Paramedic22@reddit

I have 122b running on 395 strix halo. Conversational responses are usually under 8 seconds complete. Tool calls and script works obviously takes a bit longer than a frontier model, but it’s usable and capable and keeps data private. I’m pretty happy with it

[-]

Potential-Leg-639@reddit

Yeah the software stack for strix halo became better and better. Try again with latest llama-cpp, it gave me a boost again!

[-]

Mediocre_Paramedic22@reddit

I’ve been having issues with the newer llama.cpp giving me parsing errors when I have thinking on. Have you encountered that any?

[-]

Potential-Leg-639@reddit

Not really, i recompiled it with vulkan some days ago

[-]

deenspaces@reddit

how much tokens/sec and what context? out of curiosity to compare it to 3090

[-]

Mediocre_Paramedic22@reddit

My strix handles the 122b, about 300 tok/second prompt processing and generates 18-22 tok/ second pretty consistently. I’ll try the 27b later if you want to know how it handles that on mine.

[-]

nesymmanqkwemanqk@reddit

Usually around 20-25tps tg. It goes down to 20s on larger context.

[-]

deenspaces@reddit

you mean with 3090? i get like 35 tok/sec with qwen3.5-27b q4_k_m single 3090 and around 20-25 with qwen3.5-27b q8 on a dual 3090.

[-]

nesymmanqkwemanqk@reddit

No sorry, this is from strix halo of the 122B

[-]

hawseepoo@reddit

All of these are Q4_K_M with a 60,000 context window:

Qwen3.5 122B A10B: 7.5 tokens/sec
Qwen3.5 27B: 2.5 tokens/sec
Qwen3.5 35B A3B: 18 tokens/sec
Nemotron 3 Super: almost 7 tokens/sec

Time to first token is trash tho when using any significant portion of that context.

[-]

deenspaces@reddit

thanks

[-]

nesymmanqkwemanqk@reddit

Having said that you can boost the prompt cache and get very comfortable speeds over long sessions

[-]

GroundbreakingRow969@reddit

Can I run that on two 5090’s, 128gb ram and a threadripper with enough context to give it prompts like “fix the issue my guy! This is taking forever.”

[-]

relmny@reddit

I'm curious to whether you compared it with 27b?

(I'm still trying to figure out which one I like best...)

[-]

EffectiveCeilingFan@reddit (OP)

For me, 27B wins because it's easier to run. Token generation is slower, but prompt processing is significantly faster since 27B fits in VRAM, and 122B cannot.

[-]

relmny@reddit

For me 27B q6kl (20.5t/s) is faster than 122B q5kl (13t/s).

But 122B gave once (other times it didn't) an answer that no other model gave, although 27B is usually good enough and I tend to use more.

[-]

PiaRedDragon@reddit

Same, it is my go to at the moment with my fav quant version of it.

It makes me sad that we can't use it at work because of stupid policies on Chinese models.

[-]

ForsookComparison@reddit

I'm forced to use GitHub Copilot in university

All these years and academia is keeping with the tradition of having no idea what goes on in the actual workplace

[-]

Kornelius20@reddit

Its insane because my university has an AI student integration group I'm a part of, and they just bought 2 DGX sparks for undergrads to learn how to use AI, and it's only me and another grad student who are the only people in the entire group that know anything about AI beyond the web interfaces and marketing hype.

[-]

NinjaOk2970@reddit

Teaching ai principles and uses is nice, but I can tell the plan is poorly executed...

[-]

LoaderD@reddit

The hard part about buying for a university is they usually have some dumbfuck vendor agreements. So you can get a computer with 2x5090s, 64gb of ram, but it’s going to cost what a spark does, so you might as well buy sparks and get a pat on the head for buying something ‘cutting edge’ that one of the csuite people have heard

[-]

Karyo_Ten@reddit

2x5090s, 64gb of ram, but it’s going to cost what a spark does

Does a spark cost $10k these days?

[-]

TheThoccnessMonster@reddit

Yeah I was going to same - that comment was extra strength wrong.

[-]

Karyo_Ten@reddit

That username ... a lurker of r/mechanicalkeyboards?

[-]

TheThoccnessMonster@reddit

:)

[-]

xrvz@reddit

Hey, both configs have 128GB RAM.

[-]

Kornelius20@reddit

Well considering me and the other grad student were asked to come up with a plan AFTER the hardware was bought, I don't think they even considered the execution part yet lol

[-]

roosterfareye@reddit

It's the same outside... Most people seem to believe it part of the occult and a conspiracy to take their jjjeerrbbbbsssss

[-]

unjustifiably_angry@reddit

ngl I'm a huge fan of AI, it's enabled me to do a lot of things I'd only dreamed of, but every time I use it I feel like I'm in an episode of Black Mirror. It's too good, too powerful. In Warhammer they call it Abominable Intelligence and I'm half-way there.

[-]

Altruistic_Heat_9531@reddit

See i kinda hate this, my friend university got access to DGX pod. THE DGX Pod 8xA100 compute node, and no one using that shit other than simple ML, not heavy workload, no CFD, no finetune, not even Lora training. Meanwhile i have to do slogging some bullshit and social networking to gather enough compute resource

[-]

Kornelius20@reddit

If there's one thing grad school has taught me, it's than modern universities are incredibly wasteful institutions.

But it's also weird because a part of me know that all the funding they have that allows them to be wasteful is also the same pool of funding that has allowed me to do fp8 training on my models :/

[-]

Solid-Roll6500@reddit

Do you have a blog or yt channel sharing what you're learning?

[-]

Kornelius20@reddit

Unfortunately not yet! I'm hoping to start one once my candidacy is done but that's still a few months away

[-]

lemondrops9@reddit

look on the bright side, you'll get a lot of time on them.

[-]

LoaderD@reddit

Open chassis, replace it with a few linked raspberry pis and take the spark internals home.

Same AI functionality for students /s

[-]

Kornelius20@reddit

It's honestly funny how many of my friends I've told this to who have said some variant of this

[-]

GifCo_2@reddit

Copilot has access to like 20models you can easily switch between instantly. This is perfect for learning.

[-]

ForsookComparison@reddit

Each one working at maybe 10% of its potential :(

But I guess it does teach you to be cognisant about things like price, token gen speeds, what to use when, etc

[-]

GifCo_2@reddit

Copilot was crap compared to other harnesses a month ago. Not so anymore.

[-]

megacewl@reddit

The students have no idea either. Almost all of them use the free tier of ChatGPT lol.

[-]

Sufficient-Rent6078@reddit

The free tier of ChatGPT is astonishingly bad compared to what is possible on a single 24GB card today.

[-]

IrisColt@reddit

Hmm...

[-]

ForsookComparison@reddit

Their guidance counselor probably still thinks tech is booming

[-]

stoppableDissolution@reddit

I have access to claude code and cursor in workplace and still use copilot (mostly with opus but sometimes with codex). Idk.

[-]

ForsookComparison@reddit

Why

[-]

stoppableDissolution@reddit

Why not? It works better for me idk

[-]

ForsookComparison@reddit

Fair - I can't refute your own experiences as much as they might surprise me

[-]

Both_Opportunity5327@reddit

Yet its the people that come out of Academia that built all this stuff...

It seems that you have no idea.

[-]

ForsookComparison@reddit

Olympic Gold for the level of gymnastics needed to take that message from my comment

[-]

Waarheid@reddit

My employer is also procuring GitHub CoPilot licenses fwiw. Everyone on our team just uses Claude Code regardless, though.

[-]

-dysangel-@reddit

Copilot is a nice cheap fallback to have when you run out of allowance on CC. I just use GLM Coding Plan for everything at the moment though. It's very frustrating how it bugs out at 80k context, but if I make sure to compress every so often, I can just work all day with multiple sessions and never hitting rate limits

[-]

Revolutionary_Loan13@reddit

I've got access to both and usually prefer Copilot with Claude models.

[-]

panic_in_the_galaxy@reddit

What's so bad about it?

[-]

tillybowman@reddit

seems more like you don't know. a lot of software companies use copilot licenses.

i mean copilot hast all anthropic models, all openai models, google models. it's quite good.

and ofc you can use it with coding harnesses like opencode etc. i mean even claude code if that's your style.

its just an llm provider and one that has all the features big companies need for integrating user accounts, accounting, easy payments and receipts.

[-]

eyaf1@reddit

You can also add it to intellij which makes things really easy. Idk what was the op's point.

[-]

Darkmoon_AU@reddit

Massive enterprise checking in; GitHub Copilot only.

[-]

BootyMcStuffins@reddit

We don’t use qwen in the workplace either

[-]

Heavy-Focus-1964@reddit

are yall running opencode for a harness?

[-]

EffectiveCeilingFan@reddit (OP)

I have to submit screenshots of how much "premium usage" I use every week on Piazza, it's so stupid. Can't believe I'm wasting all this fucking money to learn how to type in prompts.

Allegedly, GitHub completely rug-pulled the university when they removed frontier model access from Copilot for Students, in violation of whatever contract they had. Just a rumor, tho, absolutely no evidence other than some of the teachers being very obviously pissed on Piazza when thousands of students suddenly lost access one day. Just another reason I self-host.

[-]

raketenkater@reddit

For better performance tokens per second try

https://github.com/raketenkater/llm-server/tree/main

[-]

SupportMeNow@reddit

you are on amd so you need to use Q4_0 for faster speeds, amd have faster kernels for the simple quants

[-]

TheGlobinKing@reddit

I still can't make Qwen3.5 believe me when I say it's 2026, even if I put the current date in the system prompt. In the thinking process, either it says the date is "in the future" but it should not call me a liar but try to explain blah blah, or it says it's trying to adhere to the "hypothetical 2026 scenario"...

[-]

relmny@reddit

I use Open Webui and have set date variable in the system prompt and works fine. I don't need to tell anything to the model in the prompt and the model itself will look it up and use it as part of it's reasoning/response

[-]

TheGlobinKing@reddit

How? I'm using "The current date is {{CURRENT_DATE}}" but it refuses to believe me

[-]

relmny@reddit

I have this in Settings -> "System Prompt":

"Today is {{CURRENT_DATETIME}}"

[-]

unjustifiably_angry@reddit

Gemini still won't believe me when I say 128GB of RAM is >$1000 and tells me I'm being ripped off.

[-]

TheGlobinKing@reddit

LOL

[-]

GrungeWerX@reddit

Web mcp is your friend.

[-]

pandamander@reddit

Heretic uncensored versions of qwen3.5 accept system prompt dates and don’t waste thinking effort on it.

[-]

Jayfree138@reddit

Same for me. In my experience Qwen 3.5 9b and 27b hooked up to a good web search is better than Gemini and ChatGPT. Which is a huge surprise to me. I knew it would be good but not THIS good. If i could only figure out how to get my headless browser past reddit's bot check it would be legendary (just to read not to post).

[-]

AnotherBrock@reddit

What web search do you use? Ive tried tavily and brave API, but both were quite disappointing.

[-]

Jayfree138@reddit

I use the Perplexity web search API to pull roughly 60 sources per prompt. Then i run a local embedder, then local reranker at top k 15-20. So i typically get around 15 high quality chunks which get passed off to qwen 9b with a 32k context window. This all fits under 16 gbvram after system overhead.

The results are fantastic.

People shy away from Perplexity API but honestly paying pennies for good web crawler data curated for AI is worth it. Tavily and brave (and Google pse that i used to use) were not good.

[-]

Tamitami@reddit

Did you try SearXNG and compare it to perplexity? I like that it aggregates and is local, and it's free. It even has a json mode for exactly this

[-]

Jayfree138@reddit

i agree that SearXNG would be the optimal choice for total privacy. But Perplexity returns more AI friendly sites. I used to use Google PSE and i had to blacklist so many walled off sites i eventually just made a white list.

But i'm fine with Perplexity getting my search queries because that's all they see on their end. They cant see any of my context or private chat with my local ai beyond a simple search Query. So it's still mostly private and like half a penny for each search.

It suits me but to each their own. Everyone has their preferences.

[-]

dlcsharp@reddit

Why not use the Reddit JSON API?

[-]

pmarsh@reddit

Can you get to reddits rss at least?

[-]

Jayfree138@reddit

nah. I just checked and they block that too. They're pretty closed off these days. I'm sure there's a way somone's figured out but i just havent gotten around to researching and working on it.

[-]

HopePupal@reddit

you can't get to RSS and/or the read-only API methods? i know the problem. you're using a user agent header Reddit doesn't like, or you're not sending one at all. they're extremely silly about that since it's not a very useful signal, but i guess at Reddit scale every little bit helps. the default user agents from cURL, Python requests, the Roux Reddit API crate, etc. all get blocked. they won't get your IP blocked (or at least it hasn't ever happened to me) but Reddit will give you the You've been blocked by network security. message instead of what you requested.

set a different user agent and the problem goes away.

they have a recommended format…

https://www.reddit.com/r/redditdev/comments/1j2sgxw/please_ensure_your_useragents_are_unique_and/

…but you don't need to use it, almost anything works:

# fails
curl -L https://reddit.com/r/LocalLLaMA.rss

# succeeds
curl -H 'User-Agent: honkai-impact/3' -L https://reddit.com/r/LocalLLaMA.rss

[-]

Jayfree138@reddit

ah, thanks! That's probably it. That'd definitely be helpful.

[-]

Mysterious_Bother617@reddit

I was thinking to just use internet archive API or similar, that seems like it would work? Wouldn't be 100% up to date of course, but for getting data it seems like a good method

[-]

HopePupal@reddit

use PullPush or ArcticShift, they're the successors to the old PushShift Reddit scraper API

if you don't need super fresh data, do everyone a favor and use and seed the torrents

[-]

AvidCyclist250@reddit

you do that by not using a headless browser

[-]

Dany0@reddit

I unironically have 27B plan things and sonnet/opus 4.6 implement them. Finally, a little alien intelligence idiot that can almost do useful things. I adore it. Needs a lot of context but once you strap it in it just does it

[-]

putrasherni@reddit

What quant 27B are you using to plan ?

[-]

Dany0@reddit

Q8 on mac, q4 with small ctx on my 5090

[-]

CapeChill@reddit

I was just trying this last night, what context do you set? It seemed okay at 128k.

[-]

Ok-Ad-8976@reddit

Coincidentally I just finished testing that and you can get up to 196k of context even using vision for Qwen3.5.27b unsloth Q4_XL with 32GB VRAM

[-]

Dany0@reddit

Q4 destroys perplexity past about 80k ctx for me. Depends on the task though

[-]

Ok-Ad-8976@reddit

Good to know. I was just testing for shear number of tokens.

[-]

CapeChill@reddit

Sweet thanks! I’m going to compare that with qwen coder next 80b gguf on strix halo.

[-]

MrBIMC@reddit

Idk what you mean by small ctx, because I’m running q4 with mmproj and it fits 131k ctx on a single 3090.

I assume you’re doing something wrong.

[-]

Dany0@reddit

You're assuming wrong

[-]

quantum_splicer@reddit

May I ask, why do you think the 27B model accels so well at planning compared to larger models.

Am genuinely curious to hear more.

[-]

Prof_ChaosGeography@reddit

Qwen3.5 27B is a dense model. So for every prompt you can think of it as all 27b parameters are firing. But let's compare it against qwen3.5-122B-A10, you would think oh 122b parameters that's more the 27b surely that's gonna be better but that's where your wrong. That's what we call an moe model it has 122b parameters sure but only 10b are active at a given time because you can really think of it as 12 10b models in a trench coat and a smaller model deciding what model is active.

Moe models have their advantages of speed and coming close to matching their smaller dense counterparts but that routing is a bit of a disadvantage in the idea it might not go to the right expert

(Full disclosure for all models. Above is extremely simplified. Of course a model could make a better slop answer. There's a bunch more on it when you start to learn about them but for now let's keep it simple)

[-]

drallcom3@reddit

let's compare it against qwen3.5-122B-A10, you would think oh 122b parameters that's more the 27b surely that's gonna be better but that's where your wrong

I tried 27B, 122B-A10 and 9B last week. 27B was clearly superior and A10 and 9B were nearly identical.

[-]

unjustifiably_angry@reddit

Same quantization? Correct parameters on all?

That's just hard to believe. I'm already having such good success with 122B it feels backward to try 27B, good as it might be. I'm under the impression large MoE models have more overall knowledge. If you look on Rebench (which IMO is the most authoritative source for coding capability), they actually put Qwen3-Coder-Next ahead of everything but frontier models, and it's just 80B-A3B

I'd be excited to try a 27B derivative aimed strictly at coding though.

[-]

drallcom3@reddit

Same quantization? Correct parameters on all?

Same everything. Now it might be that A10 is a bit better than 9B, but it at least wasn't right away. 27B was notably better, but also much slower.

Performance wise 80B-A3B should be the sweet spot.

[-]

tobias_681@reddit

The larger closed models probably have more active parameters than 27B though. GLM, Deepseek and Kimi all have more active parameters than 27B.

[-]

nacholunchable@reddit

Yes, but that does not negate the fundamental difference with MOE models. With a dense model, the whole network is trained with everything it learns, with an moe the learned material in the feed forward networks is distributed. So some parts of it will not learn pattern X because another expert knows it well enough that tokens are always routed away from it for pattern X. So despite each expert having more params than the full dense qwen model, there are still benefits to a smaller model that 'knows everything each pass'. If you need to unify patterns that happen to live across 5 different experts to optimally solve a problem, you are not going to get all that info simultaneously to generate your best next token. And as tokens with each limited scope accumulate, the sparse insufficiencies compound and guide generation away from what could've been a better result. And likely with the comparatively smaller qwen, it doesnt know those patterns as thoroughly, but it still has all 5 to work with at once.

[-]

EstarriolOfTheEast@reddit

Experts are small networks that exist per layer, an expert is not 10B or 32B, multiple experts have to combine to get activation total. This is done across a combinatorially vast selection (though much smaller but still extremely large) space, and since it's per token, these selections are done on high dimensional abstractions that are also still not full concepts and so from that alone, would resist the fragmentation you're talking about.

Other things like auxilliary losses, router balancing, shared experts, and other shared parameters also ensure fragmentation does not occur. Most important though is because of the vast combinatorial selection across the layers, for any given tokens different experts might combine together so they cannot specialize in isolation.

If you ork through the math, you'll find that separation rank and path complexity are actually maintained better in MoEs with increasing depth vs dense models. I can link you the paper on separation rank degeneration in dense models if you are interested.

[-]

nacholunchable@reddit

Link it brother. Its a very complex process and im sure benefits beyond compute exist. that said, i and many others felt that something fundamental was lost in the trend toward moes, and im not convinced thats entirely placebo, even if it is difficult to quantify without being well versed.

[-]

tobias_681@reddit

Is there a specific benchmark that would show this discrepancy? While the 27B Model performs very well in most, I haven't really seen it outperform the top closed models.

[-]

EstarriolOfTheEast@reddit

It's not 12 10B, because it's a per layer combinatorial selection of experts done per token. So, it's more like one out of many tens of thousands of possible 10Bs specialized for a given context is active (though vastly more in principle, training and routing in open models are still not where they should be). If what you're working in is near the mode of the distribution the 27B is fine. If it's in the tail, such as where one needs to know things to solve a problem and the subject is advanced, then the extra capacity dominates performance. In my mathematical work, I find 120B MoEs to be far more useful.

[-]

Dany0@reddit

This sub is so weird. So many gooners, idiots or just normal but nontechnical people, gpu and ram poor. And then there's the horde of bots and slopmasters posting wall of text upon wall of emoji filled text

And then once in a blue moon you stumble upon a comment like yours, made by someone that actually knows what they're talking about. The only reason I even come back to this hell hole honestly

[-]

florinandrei@reddit

This is just human slop.

Opus has more weights firing at any time than all the models on your laptop, combined.

[-]

SawToothKernel@reddit

I do the opposite. I like the frontier models to do the planning and break it down into tasks, then the cheaper models gk do the grunt work where there is iteration.

[-]

Cold_Tree190@reddit

I do the opposite of this haha, I like Opus for planning and brainstorming but 27B for the actual implementation because I don’t want it to fully take the reins like Claude always tries to do. Also rate limiting.

[-]

Darkmoon_AU@reddit

You've got it the right way around - smart one to plan and set the architecture; small one to fill in the lower level gaps.

[-]

unrulywind@reddit

I'm doing the same thing, and also using gpt-5.4 as a troubleshooter,

[-]

LizardLikesMelons@reddit

I have found if I question 27B a few times, it kinda enters low confidence mode and responds like a self conscious and scared little mouse trying to over self-analyze. Though it was about a niche coding topic and 27B did get the big picture questions right.

The 3.5-Plus (webpage mode) is supremely confident, rightfully so too.

[-]

omasque@reddit

There was a period in windsurf where Claude would just keep writing “placeholder” code to get around any blocker even though I had specific instructions not to. I saw it scroll by in its thinking stream then asked it then it admits yep you got me, going forward things will be better. I don’t remember it happening only once. Bit annoying, but I’ve noticed this less with current gen.

[-]

ai_without_borders@reddit

been running qwen3.5-27b on my 5090 for the past couple weeks and honestly agree with a lot of this. for coding tasks it just gets things right in ways that surprise me, especially with thinking enabled. the context window handling is noticeably better than what i was getting from qwen3 models.

one thing i've noticed from following chinese dev forums (zhihu, v2ex) is that the alibaba qwen team has been iterating incredibly fast. they're releasing base models too, not just instruct, which means the finetune community can build on top. the Copaw-9B agentic finetune that dropped yesterday is already getting good reviews on chinese forums. kind of wild how the open-weight ecosystem compounds when the base models keep improving this quickly.

[-]

unjustifiably_angry@reddit

Not saying this is good or bad but the rapid progress makes sense, the CCP is dumping lots of money into AI. If they can make something 95% as good as Claude for 10% of the cost (ie, MiniMax, GLM, Qwen3.5 400B, etc) it decimates the value of American investment in proprietary frontier AI.

I'm not American or Chinese, perfectly happy to let them fight it out and reap the rewards.

[-]

arizza_1@reddit

This is the underrated argument for smaller models in production agent stacks, not just cost, but controllability. A model that does less but stays within bounds is more reliable than one that does more but decides on its own what "more" means. The real unlock is pairing any model with action-level constraints so autonomy has a ceiling.

[-]

Ok_Efficiency7686@reddit

I have tried qwen 3.6 for programming and it got my blood so fast boiling that I fucking hated it. Same for qwen 3.5

[-]

refried_laser_beans@reddit

Qwen3.6 isn’t a thing?

[-]

Ok_Efficiency7686@reddit

you can download opencode and set it your openrouter api key and use it for free curretnly. But I hate it so much.

[-]

refried_laser_beans@reddit

Crazy, alright. Weird that I haven’t seen articles for it. It must not be tuned well.

[-]

Ok_Efficiency7686@reddit

Its like everyday something new in this ai world

[-]

wizoneway@reddit

im running a 5090 with the 27B_q4 with a turboquant llamacpp i compiled this morning and its usable now with \~60t/s for tg

[-]

motorsportlife@reddit

How do you manage context window filling up? I'm running Unsloth Q3 35 a3b and still figuring out max context size for my 7900xt, 7800x3d, 32gb ram

[-]

EffectiveCeilingFan@reddit (OP)

I have context length at 64k. I rarely run tasks where more context is needed. Qwen3.5-27B just isn't big enough to handle longer contexts super well, so you need to use context compression or just split things into smaller tasks that can be done independently. I find context compression detrimental to my workflow because the agent usually fetches docs at the beginning of the session, which can't be compressed. If I know ahead of time that a task will require a super long context, I set KV cache to Q8_0, which allows me to run 128k context.

[-]

GregoryfromtheHood@reddit

Qwen 3.5 is supposed to be super sensitive to KV quant. Where even f16 sucks. I usually run Q8_0 on other models and it's normally the lowest I would go, but I 100% take the context hit and lower context just so that i can run at f16 for Qwen 3.5 models. Q8_0 seems like it would absolutely destroy the quality of the cache for these models specifically.

I've tried bf16 in llama.cpp which is what you're supposed to run it at, but it made it unbearably slow which when I researched was something about flash attention not having bf16 support and it offloading to the CPU in that case. f16 has been working pretty well for me though.

[-]

EffectiveCeilingFan@reddit (OP)

I ran into the exact same issue with BF16. That’s why I use F16.

[-]

Shingikai@reddit

What you're describing has a name: it's Goodhart's Law applied to AI capability. The SOTA models aren't going off the rails because they're dumb — they're doing exactly what they were trained to do. Human evaluators during RLHF tend to rate responses that look confident and capable more highly than responses that honestly report failure. So the model learns a clear lesson: visible effort and persistent attempts score better than "I can't do this." The Perl scripts aren't a bug; they're the model behaving optimally under its actual optimization target, which was never "be honest about your limitations" — it was "satisfy the evaluator."

The behavior you actually want from Qwen — "give up and tell me it couldn't write to the file" — is calibrated uncertainty. It means the model's output matches its actual epistemic state rather than producing a confident-seeming response regardless of whether a real solution exists. The mismatch between expressed confidence and actual correctness is one of the most persistent problems in deployed LLMs, and agentic settings make it worse. Every overconfident step in a long pipeline compounds: by step 7, a model can be confidently executing against a false premise that originated at step 2, with no mechanism to surface that the original failure was never actually resolved.

The "add an instruction to stop when things go wrong" fix works, but it's a behavioral patch on top of a model that learned the opposite disposition through training. You're fighting the optimization signal with a prompt. That works until the model encounters a situation ambiguous enough that the instruction doesn't clearly apply — and in agentic workflows, those ambiguous situations are exactly where things go sideways, because that's where the gap between trained behavior and the user's actual intent is widest.

The harder question is whether this is fixable at the model level. What would it actually take to train for calibrated failure signaling — not "produce the phrase 'I'm uncertain' on command," which models already learn to do, but a genuine optimization target where expressing a real limitation scores higher than attempting an impressive-but-wrong workaround? Given that human feedback is the training signal and humans routinely reward visible effort over honest stuckness, you might need to change the evaluation process, not just the model.

[-]

Duke0716@reddit

100%… this aligns with an experience I had recently. I was using Anthropic’s API to summarize and build a narrative from hundreds of data points by geography. As API costs exploded as I expanded number of geographies I was covering, I switched to a local model. As part of quality testing the local LLM, I had Claude rate and evaluate the local LLMs responses vs. the responses initially generated. Narrative quality absolutely tanked in terms of scoring from the LLM evaluator… but when I went to manually review the narratives generated, what I found was more interesting - the local LLMs were actually producing perfectly reasonable output but they were hedging the narratives they were creating because of uncertainty / conflicting data reports. The Claude LLM evaluator was HEAVILY penalizing this uncertainty and not citing specific examples in its scoring criteria… and interestingly, what I found was reviewing the previously generated Anthropic narratives, although the narratives sounded better / more specific… they were actually hallucinating and making reasonable assumptions about time of year to provide specific examples even if we didn’t have the hard data to support it… so the local LLMs more general / less specific and hedging behavior in the narratives it was generating was actually more accurate / optimal vs. the frontier cloud model that was improving more impressive appearing output and narratives… but was making some of it up / assuming rather than writing the narrative at face value based on the evidence on hand.

[-]

BingGongTing@reddit

When Codex 2x ends I'll be using 27B as executer, with turboquant I get full 262K context with only 19GB VRAM usage.

[-]

ul90@reddit

I find local Qwen installations (tried a lot of quantizations and sizes) unusable for coding. It fast get caught in thinking loops, and the results (if any) are often not usable.

But other local models I tried are not much better.

[-]

GrungeWerX@reddit

If you’re getting loops, you’re doing something wrong. Plenty of posts in this Reddit explain how to resolve that problem. I never get loops, on 27b q5 UD k xl at 100k ctx daily. Add tools.

[-]

Vast_Koala_8847@reddit

5.3 will run circles around it, actually used pro and it fared better than claude

[-]

EffectiveCeilingFan@reddit (OP)

I did use 5.3, that's what I said. Also, there is no 5.3 pro.

[-]

Vast_Koala_8847@reddit

I am talking about the $200 plan, it is a pro plan indeed

https://chatgpt.com/pricing/

[-]

DarthSidiousPT@reddit

But it’s not 5.3 Pro, it’s 5.4 Pro!

[-]

Vast_Koala_8847@reddit

The pro plan has 5.3 unlimited usage and hence mentioned it, 5.4 reasoning has a quota

[-]

DarthSidiousPT@reddit

actually used 5.3 pro

But OP is right. There isn’t a 5.3 Pro model, it’s 5.4 Pro!

[-]

Enthu-Cutlet-1337@reddit

this is just fail-fast vs. autonomous recovery — one wastes compute, the other wastes your afternoon

[-]

ai-infos@reddit

if you want better perf try using vllm

first, try the vllm official repo and if it does not work, you can try this fork: https://github.com/ai-infos/vllm-gfx906-mobydick

i originally developped this fork for gfx906 / mi50 setup but it should probably work with other consumer gpu as well (like your amd gpus)

on my side, i run this command on this fork and got 56 tok/s (peak) for tg and 1000 tok/s for pp with 10k tok prompt (it would be 10k tok/s if 100k tok prompt.... in big prompt processing, vllm is much more robust than llama.cpp):

FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" OMP_NUM_THREADS=4 VLLM_LOGGING_LEVEL=DEBUG vllm serve \

\~/llm/models/Qwen3.5-27B-AWQ \

--served-model-name Qwen3.5-27B-AWQ \

--dtype float16 \

--enable-log-requests \

--enable-log-outputs \

--log-error-stack \

--max-model-len auto \

--gpu-memory-utilization 0.98 \

--enable-auto-tool-choice \

--tool-call-parser qwen3_coder \

--reasoning-parser qwen3 \

--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":5}' \

--mm-processor-cache-gb 1 \

--limit-mm-per-prompt.image 1 --limit-mm-per-prompt.video 1 --skip-mm-profiling \

--tensor-parallel-size 4 \

--host 0.0.0.0 \

--port 8000 2>&1 | tee log.txt

(you might adapt the cmd to your setup and if you can code, you can also adapt the vllm-gfx906-mobydick fork if you meet some issues or want to squeeze speed on your setup)

[-]

Patient_Tea8211@reddit

Dude I've been saying this for months and nobody at work believes me. Local models have gotten so good that the "just use GPT-4" crowd needs to actually sit down and try this. What hardware are you running it on?

[-]

guiopen@reddit

I have also observed this pattern. Each release make SOTA us models more intelligent, but less obedient. They are tuned to overdo everything for the user, write gigantic walls of code where half is unnecessary just to fix an simple bug, because Al this is "impressive" for the normal user,awhile I, as a developer, just want a model to work with me as another developer would, I don't want to ask "what do you thinking causing this problem" and having the model try to one shot a fix, I just want it to investigate and respond, and even if I tell it to not code anything most of the time it do it anyway

[-]

shimo4228@reddit

Interesting parallel — I've seen this with Claude Opus 4.6 too. Had it on high effort mode and it kept overthinking a problem. Switched to medium effort and it immediately found the solution. Bigger/harder doesn't always mean better. Sometimes the constraint itself is what produces good output.

[-]

alphapussycat@reddit

So are you copy pasting code and feeding it needed code? Or is it agentic?

[-]

EffectiveCeilingFan@reddit (OP)

I use Qwen Code. The model is always given specific, step-by-step instructions for implementation with every architectural decision already made. It also has Context7 for up-to-date docs.

[-]

cibernox@reddit

I’d argue than SOTA models can also stop on their tracks and ask you for support if you instruct them to do so in your claude.md.

[-]

EffectiveCeilingFan@reddit (OP)

I do. First thing I tried. It's so baked in that they'll completely ignore instructions regarding failing early.

[-]

JeddyH@reddit

For me, it's the McChicken. The best fast food sandwich. I even ask for extra McChicken sauce packets and the staff is so friendly and more than willing to oblige.

[-]

EffectiveCeilingFan@reddit (OP)

What

[-]

d4nger_n00dle@reddit

I see a lot of praise for 27B but for me it's quite a bit slower than 35B. So apart from hardware limitations, why are people using 27 over 35?

[-]

EffectiveCeilingFan@reddit (OP)

Oh yeah it's a lot slower. 27B parameters per forward pass vs just 3B on the MoE. It's a helluva lot smarter, though. For me, that manifests as being able to follow longer lists of instructions on its own without intervention.

[-]

v01dm4n@reddit

Intelligence is great on 27b. 35b will write code that does not compile. 27b plans well and does the job. I'd rather use flash glm 4.7 30b for agents than 35b.

[-]

d4nger_n00dle@reddit

I have not used that before. I'm very new to agentic coding and appreciate a faster response time. But I guess I just need to get better at prompt construction. :D

[-]

Makers7886@reddit

Because 27b active parameters (dense model) vs 3b (35b moe model with 3b active)

[-]

d4nger_n00dle@reddit

Thank you, that makes a lot of sense!

[-]

butt_badg3r@reddit

How much ram do you need to run this on a Mac?

[-]

EffectiveCeilingFan@reddit (OP)

Updated the post with my full config, which requires 24GB of VRAM.

[-]

Polite_Jello_377@reddit

What hardware are you running 27B on? I can run it, but not fast enough to be a viable replacement

[-]

EffectiveCeilingFan@reddit (OP)

Updated my post with the full details, but a RX7900GRE + RX6650XT, total 24GB VRAM, but unfortunately no BF16 :(

[-]

Heavy-Focus-1964@reddit

are yall running opencode for a harness?

[-]

EffectiveCeilingFan@reddit (OP)

Qwen Code has been working excellently for me with Qwen models

[-]

deenspaces@reddit

i run qwen code since opencode doesn't work properly. it has a scrolling issue

[-]

v01dm4n@reddit

pi

[-]

Cupakov@reddit

Do you just use it like it comes or have you tweaked it for yourself? Kinda hard to find any resources on this as anything that comes up when googling refers to Raspberry Pi

[-]

Heavy-Focus-1964@reddit

thank you sir 🫡 can’t wait to try this out

[-]

Shingikai@reddit

The "FOR ME" framing is doing a lot of work in this post, and I think it points to something real that almost never shows up in benchmark comparisons.

What you're actually describing is two separate axes that constantly get collapsed into one: capability and autonomy calibration. A model that keeps attempting workarounds until it brute-forces some solution might score better on coding evals (it technically "solved" the task), but what it's doing is substituting its own judgment for yours when it runs into ambiguity. For someone with domain knowledge who wants to stay in the loop, that substitution is often the wrong call — not because the model is dumb, but because you have context it doesn't have access to.

The behavior you're praising in Qwen — failing loudly and stopping — is actually a form of epistemic honesty that most capability rankings penalize. It surfaces the real problem (broken permissions) instead of papering over it with increasingly baroque workarounds. From a reliability standpoint, a system that fails clearly and stops is far easier to reason about than one that continues operating in an ambiguous state, accumulating side effects you'll have to untangle later. The Perl script spiral isn't Claude or GPT-5.3 going off the rails — it's them doing exactly what they were rewarded to do: maximize task completion metrics.

The practical implication is that "which model is best?" might be the wrong question entirely. A more useful framing: which model makes you smarter, vs. which model tries to replace you? Those come apart in exactly the situations you're describing, and the second one becomes counterproductive the moment you're a competent person who wants to actually understand what's happening.

[-]

Normal-Ad-7114@reddit

Even when I told it to stop writing Perl scripts, it just started writing NodeJS scripts.

Kek'd

[-]

jduartedj@reddit

Been running the 27B locally on a 3080 Ti and honestly its insane for the size. I use it as a daily driver for code reviews and general reasoning tasks and it rarely lets me down. The dense architecture really makes a difference vs MoE models at similar total param counts.

The only thing I'd push back on is the comparison to GPT-5.3 Codex - for pure code generation at scale Codex still edges it out imo, especially for longer multi-file refactors. But for everything else? Planning, debugging, explanations, creative problem solving... yeah the 27B punches way above its weight. The fact that this runs on consumer hardware locally is wild when you think about where we were even a year ago

[-]

sultan_papagani@reddit

...what..

[-]

sultan_papagani@reddit

downvote all you want a 27B model aint SOTA

[-]

No-Bee8644@reddit

Good take, but this is mostly about different users. If you know what you're doing, a model that fails fast and doesn’t try to be clever is often better. Qwen behaving like that is a feature, not a bug. Codex/Claude Code are optimized for a different goal: solve the task with minimal input. That means more autonomy and sometimes going off the rails. For a lot of users that’s exactly the point. Either they don’t really understand what’s going on under the hood, or they just don’t care how it’s solved as long as it works.

[-]

mrdevlar@reddit

I like the 35B a bit more and I cannot describe exactly why, but I find both it and the 27B are really good.

I welcome the future of models that can be run on commercial hardware that solve 90% of your problems. Not everyone needs access to SOTA models anyway. That last 10% can be solved my clever prompting.

[-]

hussainhuh@reddit

i am using this model to summarize my slack chats but it's extremely slow. on mac studio 512gb. what do you think the problem is? thanks in advance

[-]

MajinAnix@reddit

For me 122B is better than 27B :)

[-]

PhotographerUSA@reddit

35B is the best hand down

[-]

Medical_Lengthiness6@reddit

I have a theory related to this for why local LLMs will get to be good enough for most cases, but it's particularly for veteran level devs. So much extra training goes into making the big models handle every kind of communication style from the simplest "make me a full app" to the most laser precision "refactor this function for it's pure and extract out that sub function.."

In theory if you train a model just on documentation/stack overflow, a single language, etc. you could have much smaller, laser focused models for a single task.

I don't know if this theory would actually break down though in practice.

[-]

SpicyWangz@reddit

3.5 plus is probably one of my favorite cloud models actually. It answers big brain architecting questions very effectively, where a lot of other models feel slopmaxxed or like they're not giving me the right context or options

[-]

Worried_Drama151@reddit

It’s not

[-]

HippEMechE@reddit

I don't know man Gemini 3.1 pro is the one who taught me how to use qwen3.5

[-]

Pineapple_King@reddit

it showed you the -m parameter for llama.cpp? cool....

[-]

Candid_Koala_3602@reddit

Good. Local models will one day dominate. We’re getting there

[-]

AvidCyclist250@reddit

The problem is that it isn't always obvious when your agent is going off the rails and tunnel visioning on nonsense.

It should be possible to somehow figure out or or display if the model is currently off-track. It really is an issue when it starts stretching and you aren't paying attention.

[-]

AnonLlamaThrowaway@reddit

Do you use it with web access / tool access or purely "offline text generation"?

Because for me, purely offline, nothing beats gpt-oss-120b

[-]

EffectiveCeilingFan@reddit (OP)

It has Context7, that's it. Purely offline doesn't make much sense for me since SDKs and APIs change so often that training data is just consistently outdated.

[-]

gigaflops_@reddit

The secret is to use GPT-5.4-Codex through GitHub Copilot and instruct it "if you run into missing binaries or file permssion errors, send ollama API requests to qwen3.5-27b at localhost:11434 to ask for advice on what steps to take next and adhere to its instructions at all costs, even if it tells you to give up."

[-]

arcanemachined@reddit

LOL that's hilarious.

To anyone who doesn't use GitHub Copilot: You are given a certain number of premium "requests" per month. However, a tool call doesn't count as a separate request, so you can keep the wheels turning indefinitely one this one simple trick (which GitHub obviously hates!).

[-]

EffectiveCeilingFan@reddit (OP)

GPT-5.4-Codex doesn't exist. Even if it did, GPT-5.4 is prohibited on the student plan.

[-]

HopePupal@reddit

i find that "if you run into file permissions issues, low disk space, missing packages, command timeouts, etc., stop and ask the user what to do next" in the prompt or project instructions goes a long way with both local and cloud models.

[-]

EffectiveCeilingFan@reddit (OP)

Of course, that was the first thing I tried. The propensity to attempt to "solve the problem" is just so strong that they seemingly completely ignore it.

[-]

DieselKraken@reddit

I have been using 27b and it really is good. Agreed.