[-]

kevin_1994@reddit

27b at q8 is frankly unbelievable. It stays coherent well after 100k+ context. It has essentially no knowledge but I hook it up to the internet and tbh it feels frontier level for me most of the time. Just don't let it tell you anything without looking it up lmao.

For coding, I honestly can't believe I'm able to run this on only 48GB of VRAM. It feels like all I'd ever need. I'm a software developer and don't really do "vibecoding", but its been excellent at helping me debug issues. The other day it helped me debug a weird issue in our OAuth server's PKCE implantation by executing curl commands on OpenTerminal, researching OAuth and PKCE RFCs, and writing various test applications in node until it could replicate the issue. Well over 100k context. Various impresssive imo

The 35b in comparison (at native bf16) is fast but much sloppier, misses a lot of nuance, and falls into traps much more frequently that it can't recover from. Its still very strong though. I use it quite often when I have a banal task that I want quickly.

[-]

PWCIV@reddit

how do you hook it up to the internet? im using brave api mcp

[-]

kevin_1994@reddit

I use Open-WebUI + SearXNG, with following methods to improve results:

I wrote a skill that just tells the model to use the fetch_url and not rely on snippets
I wrote another skill called Reddit_Sleuth (lol) which instructs the model to fetch reddit content using OpenTerminal using a bunch of various cli tools which bypass rate limit protections
Use native tool calling, Open-WebUI's "default" web search feature sucks. It's way more powerful to let the model just fetch what it needs

[-]

amalcev@reddit

Could you share these skills please?

[-]

kevin_1994@reddit

Sure. They are mostly slop but seem to work OK:

https://pastebin.com/0uP5MFRX

https://pastebin.com/26BugbUc

[-]

sonemonu@reddit

Can you please tell me the command you use to run the model? I have had memory issues on my M4 max 64gb.

[-]

kevin_1994@reddit

taskset -c 0-15 /home/kevin/ai/llama.cpp/build/bin/llama-server -m /home/kevin/ai/models/Qwen3.6-27B-Q8_0.gguf --mmproj /home/kevin/ai/models/mmproj-qwen3.6-27b/mmproj-F16.gguf -ctv q8_0 -ctk q8_0 -c 156000 -ngl 999 -np 1 -t 16 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' --port 1231 -a model

I use taskset and -t 16 to only use pcores on Intel i7 13700k. Gives like 5% boost
use -np 1 so that kv cache isn't allocated twice, just once, since I'm rarely processing multiple queries at the same time
attention at q8 since it gives more context and I don't notice any difference between q8 with attention rotation and bf16 for this model specifically
batch and microbatch at 4096 gives best pp with 3090/4090 in my experience
disabling mmap gives slight perf increase but not worth it imo because model swapping way slower
using latest llama.cpp (as of yesterday) and ubuntu 26.04 with nvidia drivers 595 and cuda 13.2
runs around 30 tok/s and 1500 pp/s

[-]

mr_Owner@reddit

Did you ever look and try qwen3 coder next?

[-]

kevin_1994@reddit

Yes. I like it a lot. My model progression from 2024 has gone something like: Llama 3 -> QwQ -> Qwen3 32B -> GPT OSS 120B -> Qwen3 Coder Next -> Qwen3.6 27B

I found coder next really powerful, but only at Q8, otherwise it would misinterpret tasks in a subtle way. I also found this model would collapse around 100-120k tokens completely and end up messing up tool calls.

Basically coder next is WAY faster at output, a little bit slower for my machine on input (have to offload about half the model), not as powerful (but still powerful), and seems to be singularly optimized for web development tasks (which is 90% of what I do, but that 10% though)

[-]

dannydeetran@reddit

I agree with you, 27b q8 with full kv cache is amazing.

Sure the speeds are slow but quality output is actually impressive.

[-]

IrisColt@reddit

but I hook it up to the internet

Teach me senpai... What do you use for that?

[-]

Snoo_27681@reddit (OP)

Thanks for your perspective. This is kinda what I thought would happen: 27B would be more reliable than 35B. But in a multi-step workflow I'm not finding this to be the case. So most of my queries are less than 20k tokens total, with some up to 80k tokens.

How are you using these models? In an opencode/claude code type chat interface? Do you try to reset chats frequently or do you meander in conversation like you can do with Opus?

[-]

audioen@reddit

You should not draw any far-reaching conclusions when you use nvfp4 quants, of either model. Try to run the official fp8 versions at least for the 27b -- I don't care about the 35b personally, anymore -- because these models are much worse at 4 bits.

[-]

kevin_1994@reddit

I haven't used closed models for a long time, except when work forces me to. So I can't say for sure how it compares to Claude.

I use them via Open-Webui and OpenCode. OpenCode is pretty straightforward. For Open-Webui I enable native tool calling, web search, OpenTerminal, and have skills to tell the agent how to use the tools effectively. Example: I have a skill called Researcher which tells the model to cross reference facts, not rely on SearXNG snippets (fetch the URLs and read them), and how to bypass web scraping protections lol.

Generally speaking I don't meander though. I use AIs almost exclusively as a tool. Usually I have one specific thing I want.

[-]

CircularSeasoning@reddit

27b at q8 is frankly unbelievable.

Have you asked it for a cupcake recipe and how were they?

[-]

ayylmaonade@reddit

You sure you're not just using them in a workflow where they're both pretty much equally competent (e.g., something simple), and therefore aren't noticing the difference? The speed of the 35B might also be giving a bit of a placebo. I've fallen into that trap before.

[-]

Snoo_27681@reddit (OP)

I suspect that is what's going on: I'm intentionally trying to make problems small so either model may be good enough in my tests. As opposed to harder tests or multiple idea chats. I try to stay to one prompt per chat.

[-]

coder543@reddit

The 27B is used 9x as many parameters to calculate each token, and the benchmarks reflect that increased intelligence. I can't imagine how you're experiencing the 35B to be smarter. It is much faster. It is not smarter in my experience, or in the experiences of the many people you're referring to.

[-]

Dany0@reddit

it has more breadth of knowledge that's for sure. Smarter? Nooo

I did enjoy using it over Qwen3.5 27b though. In that limited time before Q3.6 27b came out. It showed me that I really do have tasks where speed trumps intelligence more often than I expected. If only I could have both models loaded in vram at the same time

If OP has a niche use case I can imagine they could be running into limits of 27B. Would be nice if they could disclose their use case, specific prompts where they thought the moe did better

[-]

Last_Mastod0n@reddit

In large pipelines there is an unload feature in LM studio to where you can have it load 35B for example for one part (for speed) then unload and load up 27B (for performance). So you get the best of both worlds

[-]

drycounty@reddit

Could you point me somewhere for this? Haven’t done any googling as I’m on phone but that sounds amazing

[-]

Arkenstonish@reddit

Not studio, but pure llama CPP has models/load /unload api, that could not be more simple.

[-]

Sea-Attention-5815@reddit

Lm studio have eject model

[-]

Cladstriff@reddit

I use LLM swap, it works great https://github.com/mostlygeek/llama-swap

[-]

Firemustard@reddit

https://omlx.ai/ it's optimised for Mac and it's doing it automatically and easy. Also you got around 20% more performance vs any other solution if you are using mlx model that is optimised for Apple silicon

[-]

Last_Mastod0n@reddit

https://lmstudio.ai/docs/developer/rest/unload

Here you go!

[-]

AntiCamPr@reddit

Simply add both models to the server in LM Studio and make sure your agent has both model names in its server/provider configuration. That's it. LM Studio will automatically load and switch models when your agent switches.

[-]

GrungeWerX@reddit

35B was no comparison to 27B 3.5 in my tests. Curious what you were doing where it was better. 35B is not even close, especially over long context.

[-]

CircularSeasoning@reddit

\> It showed me that I really do have tasks where speed trumps intelligence more often than I expected. If only I could have both models loaded in vram at the same time

Imagine what it's like to Elon Musk. Instant bytecode perfectly aligned in highest definition to every desire.

[-]

vevi33@reddit

Yeah 9x the active parameters but less total parameters. While dense models in general better (27B is indeed more smart, the difference might be 0-15% depending on the task. Not 9x smarter. Important to note imo.

Also people with 16gb VRAM and enough ram can run much higher quant from the 35B so kinda equals out, especially if you plan to use quantanized KV cache on the 27B Q4 model.

[-]

roosterfareye@reddit

I have found the exact same. I switch between them all the time for that very reason

[-]

bnightstars@reddit

I wonder if they release 3.6-9B how good it will be as it will be just between the two more active parameters than 35B but less than 27B so in essence will be smarter than 35B and faster than 27B.

[-]

vevi33@reddit

Definitely not. 9B would not be better than the 35B MoE. But a 14-18B would be competitive in speed and performance as well.

[-]

Snoo_27681@reddit (OP)

I think that I'm starting to put more work into the pipeline and break down problems more. So I need the model to be less smart in general. For example, I do a lot of firmware work and I have the bot do a internet search for drivers and that removes the need for a lot of coding smarts.

But I run pipelines in parallel with Claude Code (using `claude -p`) with Opus 4.6 1M, Sonnet 4.6 1M, Qwen3.6-35B-A3B-nvfp4, Qwen3.6-27B-nfvp4. And I'll have 35B match Sonnet and beat 27B, neither matches Opus usually.

[-]

our_sole@reddit

Can you tell me more about running CC against qwen3.6-35b-A3B? Are you using ollama/lmstudio/llama.cpp?

I am having no luck at all using llama.cpp with that llm and unsloth UD quantization with CC. CC just immediately throws an error msg saying it can't use the llm.

[-]

the_fabled_bard@reddit

LM studio feeds qwen3.6-35b-A3B to CC inside Cursor just fine!

[-]

Snoo_27681@reddit (OP)

MLX only for running the models. Beyond that I have claude search for solutions on how to run them

[-]

jedsk@reddit

I've been testing those two in a project but eventually ended up using3.5 122b A10B Q4.

[-]

Stitch10925@reddit

What hardware do you guys have to be able to run those models? I have a 2000 Euro system and only have 16GB VRAM available. I clearly did something wrong

[-]

my_name_isnt_clever@reddit

I run ~120b models on my Strix Halo with 128GB. It was $2500 (just before the RAMpocalypse).

The tradeoff for so much VRAM is pretty bad memory bandwidth, so it kills with MoE models but struggles with dense.

[-]

Stitch10925@reddit

Thanks for the insights, appreciate it

[-]

ohhi23021@reddit

yes, you didn't spend enough money.

[-]

Stitch10925@reddit

Well you know, you do what you can

[-]

rpkarma@reddit

I’m curious about why? I have also started leaning back towards 122b but I’m curious about your thoughts and comparison

[-]

jedsk@reddit

I'm doing batched scoring on vLLM. The 122B was \~3x faster than 27B for me as its MOE, plus caught false positives 35B was over-promoting.

[-]

Hot-Employ-3399@reddit

35B is significantly worse on my tests. Liie my use case is "GTFO I don't want to talk to you" so give quite big prompt and expect lots of working tests and then I go playing Minecraft for more than a hour. 35B model can take 1.5 longer to finish. Sure it's faster with tokens but tokens are worse

I didn't test too much(would take days) but where I test it I never was impressed.

[-]

lit1337@reddit

so i do quant work on both of these models and yeah this makes sense. the 27b dense uses all 27 billion parameters on every single token. the 35b moe only fires like 3b params per token, it just picks from a bigger pool of knowledge. so 27b is literally thinking 9x harder per step, its just slower doing it. the quant thing matters too. at nvfp4 youre compressing both pretty hard, but the 35b moe has tons of redundancy (256 experts, only 8 active at a time) so it handles compression way better. the 27b dense has no slack, every parameter is load bearing, so q4 hurts it more. your pipeline setup is basically compensating for the 35b being dumber per step by giving it more steps and more tool calls. thats legit, speed matters in iteration heavy workflows. but if you throw a genuinely hard single shot problem at both (complex refactor, tricky logic bug, something that needs deep
reasoning) the 27b will smoke it every time at the same quant level. ive been working on mixed precision quants that get the 27b down to like 10gb without the usual quality cliff. you figure out which weight groups can survive 2 bit and which ones need to stay at 3-4 bit. not everything in the model is equally important.

[-]

Snoo_27681@reddit (OP)

I'm striving to not throw hard problems at the models haha. I think that's what gives me good results with local models is I have a lot more care about prompts and context and settings.

How do you figure out which weight groups you can more heavily quantize? Any papers or links you recomment?

[-]

lit1337@reddit

Measure it. Run each tensor group at lower precision and check if perplexity tanks. If it doesn't move, that group can be crushed harder. If it jumps, leave it alone or promote it. I do reverse ablation start with everything at Q2_K, then promote one group at a time to Q3_K and measure PPL. Whichever group gives the biggest improvement per bit of extra space = that's your most sensitive tensor. Promote those, demote the ones that don't matter, budget balances out.

Patterns I've seen across multiple architectures: Attention K/V projections are usually sensitive, FFN gate weights can take a beating, Q2_K is usually fine, Norms (f32) you never touch, SSM params in hybrid models break below 4-bit, instant NaN. MoE expert weights are fragile but shared experts are robust. No single paper ive seen covers this exactly. As far as i know, SqueezeLLM and AWQ are in the same neighborhood but they use Hessian approximations instead of just measuring it directly. GPTQ does per-channel sensitivity but not at the tensor-group level. The llama.cpp "importance matrix" is a different signal. Honestly its alot of trial an error, and grunt work. But when youre limited on resources you squeeze water from a stone when you can lol.

[-]

alex20_202020@reddit

That looks like a cool tool, thanks!

I wonder why you have included inference speed for one model only: "Speed: 71 tok/s prompt, 36.5 tok/s generation (RTX 3090, full offload)"

More importantly, fist line is python -m osmosis.cerebellum ablate \ --base-gguf model-Q2_K.gguf \ --tensors ablation_plan.json \ --output ablation_results.json

And ablation_plan.json is not used anymore - hence it is input. But you explain as everything on auto. Where am I supposed to get the file (it is not in the repo)?

[-]

lit1337@reddit

yeah that section was written for a CLI i planned to build but never did. the manual process worked so i never got around to automating it. fixed the readme now. if you wanna do this yourself: grab a Q2_K base GGUF, then use streaming_quantize.py to promote one tensor group at a time to Q3_K. measure PPL after each one with llama-perplexity. whichever groups drop PPL the most are worth the extra bits. combine your winners into one override file that fits your size target and build the final GGUF. streaming quantizer handles any model size with like 300MB RAM. its tedious but thats where the gains come from. thanks for checking it out!

[-]

alex20_202020@reddit

grab a Q2_K base GGUF

Again Q2 base. But it got information lost from f16 already, why do you start with it, not f16?

[-]

lit1337@reddit

not unquanting it back up or anything. the Q2_K file was made from the original f16 weights by llama-quantize. when i promote a tensor back to Q3_K its requantizing from whats stored in the GGUF already. theres information loss from the initial quant for sure. but what im measuring is which tensors care about being at Q2 and which ones dont. some actually perform better compressed because it acts like regularization. you wouldnt find that starting from f16. also f16 of a 122B model is like 244 GB so i physically cant work with that lol. starting from the smaller base means each test is fast and i can iterate quickly. same conclusions either way, just way more practical in my situation.

[-]

alex20_202020@reddit

but what im measuring is which tensors care about being at Q2 and which ones dont.

Interesting, I see also python -m osmosis.imatrix_stream \ --model Qwen/Qwen3.6-27B \ --output osmosis_imatrix.dat -v and my most recent thought is that it is what makes all your manipulation to add value. Seems you get imatrix from f16, then apply it during GGUF building, it improves "base" Q2 quantization. Am I on correct track?

[-]

lit1337@reddit

not exactly. imatrix improves the base quant by telling it which values within each tensor matter more during quantization. thats a separate thing from what cerebellum does, osmosis is what cerebellum started as. cerebellum decides which entire tensor groups get more or fewer bits based on measured sensitivity. they stack but theyre independent processes. the imatrix tool in the repo was just a fast way to generate importance data on cpu, its not really part of the main workflow. the core of cerebellum is just: promote one group at a time, measure ppl, keep what helps, crush what doesnt matter. even going so far as testing individual layers of the important tensors to cut fat out after the per tensor group ablation.

[-]

alex20_202020@reddit

not exactly. imatrix improves the base quant by telling it which values within each tensor matter more during quantization.

Thank you, I did not understand the role of imatrix well. But then we get back to base Q2. You just add more space for same data. If 1.12 gets stored as 1.120, I do not understand how it is better. Maybe in a random way, sometimes it just makes model better, LLM are random token generators to start with.

[-]

lit1337@reddit

oh wait i just realized what you're actually asking. you're wondering how starting from a Q2 file can end up smaller AND better. so the trick is, even in a Q2_K file not everything is actually Q2. the norms are F32, embeddings are Q6_K, output head is Q6_K or Q8_0. but beyond that, when i run ablation on the
actual Q2_K tensors themselves, some of them actively hurt the model at that precision. so i can cut those tensors even lower or restructure the bit allocation within the same budget. the file doesn't have to grow to get better, sometimes you just have to stop spending bits on tensors that are fighting you at that precision. also you can do this to really any quant, i started from q2 for 122b because of its size.

[-]

DarkEye1234@reddit

Nice job, subsribed to watch. Are you planning qwen 122b?

[-]

lit1337@reddit

already got ablation data and promotion overrides written for it. Its just i gotta run the gpu stuff during my sleep schedule or when i dont game, so it takes a bit. Oddly enough the 3.6 27B test results are actually higher, i found a flaw in my test that added indentation to some answers causing null answers and coding errors. So it actually tests quite a bit higher on a few tests. HumanEval before: 75.0%, after: 81.1%. So I am very curious what can be squeezed out of the 122B, especially because this seems to do the most work. The inactive experts can be crushed the most.

[-]

Hekel1989@reddit

Admittedly I'm a noobn when it comes to this, but, I've got a 4090 and I use Qwen3.6 27b q4 hosted on Ollama, using Opencode as the harness, and the results are shocking. Extremely slow, and after the first attempt, it keep looping on itself and never getting to the end of it. I run it on a meager 64k context, kv cache q8, and I can't get it to be good.

If anyone got some settings they'd like to share for similar hardware I'd really appreciate it, because I really can't get it to be useful.

[-]

Freigus@reddit

It's weird that you got "looping thinking" in an agentic harness - at least qwen3.5 switches to "short reasoning" if it has tools. (I haven't tried it with opencode though)

You can try setting presence_penalty to 0.5 (higher is more aggressive, you can set up to 2.0, but for coding it's recommended to keep it low.)

[-]

satyaloka93@reddit

There is some weird knowledge gap between the models with one particular question “What is an Imatrix quant?”. 35B gets it correct, associated with llama.cpp. The 27b suggests a misspelling, then tries to relate to either math or trading. I tried with vLLM and gguf quants, just 4 bits, however.

[-]

Fragrant_Scale6456@reddit

Interesting I tried it with 27b Q6 K and got a similar answer. I asked it why it didnt tell me about llama.cpp and it responded with:

Zero context in the initial prompt: "What is an Imatrix quant" is extremely brief. Without domain cues, "quant" is heavily overloaded (quantitative finance, AI model quantization, job titles, etc.), and "Imatrix" isn't a widely recognized standalone term outside a specific open-source niche.
I defaulted to safe speculation: When prompts are ambiguous, my training biases me toward listing possible interpretations and asking for clarification rather than guessing a single answer that could be wrong. I should have been more direct about the ambiguity instead of branching into unrelated fields.
The term is relatively recent & ecosystem-specific: Importance Matrix quantization (imatrix) was introduced to llama.cpp in early 2024. It's well-documented within that community but hasn't trickled into broader textbooks, mainstream AI courses, or general-purpose search indices as a standalone phrase. LLMs like me are less likely to surface hyper-niche, recently coined tooling terms without explicit context.
Context changed everything: Once you mentioned llama.cpp, the signal-to-noise ratio flipped. I could immediately map "imatrix" + "quant" to the exact feature, workflow, and community best practices.

[-]

satyaloka93@reddit

Yeah when I mentioned the association, it gave me better results. But 35b still responded overall more completely. Gemma4 26b gave better answer still.

[-]

IrisColt@reddit

This.

[-]

Fragrant_Scale6456@reddit

Interesting findings thanks for sharing. This is the first case I’ve seen where 27b wasn’t superior to the more models

[-]

MaxEkb77@reddit

answer my 35b, check please "An Imatrix (short for Importance Matrix) quant refers to a specific quantization method used for Large Language Models (LLMs), specifically within the GGUF format ecosystem (commonly associated with llama.cpp).

It is essentially an application of AWQ (Activation-aware Weight Quantization).

The Core Concept

Standard quantization (like Q4_K_M) treats all weight matrices in the model roughly equally—meaning every part of the model gets the same level of precision (bits). However, in deep learning models, not all weights are equal: some weights are critical for the model's "reasoning" and accuracy, while others are less important.

Imatrix quantization improves quality by preserving the critical weights and sacrificing the less important ones.

How it Works

Calibration: The model is run on a specific calibration dataset (usually a few thousand tokens).
Measurement: The system measures how sensitive the model's output is to small perturbations (changes) in the weight matrices.
Categorization:
High Importance: Weights that drastically affect the output if changed are marked as "important." These are kept at high precision (e.g., 8 bits).
Low Importance: Weights that don't change the output much are marked as "unimportant." These are squeezed into very low precision (e.g., 2 or 3 bits).
Reconstruction: The model is re-quantized using this matrix. The resulting file is labeled the same standard type (e.g., Q4_K_M), but it is "IQMatrix" enhanced.

Why Use It?

Better Accuracy: An Imatrix quantized model (e.g., imatrix-q4_k_m) often answers correctly more often and hallucinates less than a standard quantized model of the same nominal size.
Same Size: The resulting file size remains the same as the standard quantization.

Technical Context (llama.cpp)

If you see files named like model-q4_k_m.imatrix, this refers to the file used to guide the quantization process. You can generate this file using llama.cpp by running the model on a dataset, and then use it when converting a FP16 model to a quantized GGUF file."

[-]

audioen@reddit

It is unfortunately an incorrect reply and doesn't describe the method correctly.

[-]

LegacyRemaster@reddit

122b all I need

[-]

IrisColt@reddit

35B is able to mimick human intuition better, but is prone to hallucinations... 27B is the most knowledgeable, but is also less insighful

[-]

Last_Mastod0n@reddit

27B is definitely smarter all around.

I tested both extensively for my project. However the difference isnt huge. It depends on what task your doing but in my vision pipeline 27B had about 10% more observations than 35B at the same quant level.

So if you want 2x the token generation speed for around 10-20% less performance then 35B is Definitely worth it.

[-]

bgravato@reddit

2x? It's more like 4-5x! At least with Q4_K_M. I haven't tested with other Qs.

[-]

CatEatsDogs@reddit

For me its 4x. 26ts on 27b_q4 and 102ts on 35b_q4. Available context amount is also not in favor of 27b.

[-]

CircularSeasoning@reddit

> So if you want 2x the token generation speed for around 10-20% less performance

Is this the corporate definition of "extroverted"?

[-]

Educational-Fruit854@reddit

[-]

CircularSeasoning@reddit

Life is Q4_K_M, confirmed.

[-]

CircularSeasoning@reddit

*unless you write it out by hand on an infinite quant piece of paper, to make tha antis happy.

[-]

audioen@reddit

I can't get reliable output of the 35b. It is fast, but it doesn't understand enough, and at least in my case, letting it loose in the codebase results in its devolution over time. I tested it on Q8_K_XL to give it the best possible chance, and the best way I can put it is that I can't get quality code out of it even if I provide it with guidance, and I have to babysit it a lot more, and the thinking loops are also much more frequent than in 27b, which can go entire sessions without one, whereas it feels like 35b is in thinking loop half of my test sessions.

But the quality was not sufficient for me to care about it. Either it requires much more preliminary work, or simply isn't able to understand code at level which is needed to perform valuable intellectual labor, rather than just proceed to quickly create a confused mess that needs to be sorted out later.

[-]

Holiday_Purpose_3166@reddit

They're both on par depending how it's used.

35B has more knowledge depth but it's limited at 3B at a time.

If you execute a job that as a small scope of information, the 35B will do better and with fewer total tokens.

In my case, since it runs 3.2x faster than 27B, I can swarm 4x instances into modular jobs (better MoE routing) and loop PR reviews for faster fixes.

Not all work is the same. On my custom benchmarks, 35B-A3B comes out on top with better scoring where 27B trails closely.

Outside the benchmarks I've also noticed 35B-A3B was spending half the tokens for the same tasks with Hermes, and spends fewer tokens with thinking-on - it did more tool calls without and answer format was less polished.

I like both models, but one can do more work with 35B-A3B. If dense model was the ultimate choice, top labs would use it, but it's too heavy to run.

[-]

Steus_au@reddit

i have asked 35b to enbale vision capabilities in opencode with llama-server backend and even with access to tavily it can’t do it at all.

[-]

ComfyUser48@reddit

The only reason to prefer the 35b MoE is speed and that's it.

[-]

tecneeq@reddit

The results are better with 27b i think, but speed is the problem for me. So i stick with 35b-a3b at work and at home.

[-]

silverud@reddit

On my system (M3 Max 128gb), 35B-A3B is 50-55 tps at Q8_0.

27B is 10-11 tps at Q8_0.

27B is smarter and produces better output. If I had to quantify that, I'd say it is between 10-20% better. In some cases that is the difference between right and wrong - between wasting a pass or tool call and being useful. In other cases it is a marginal difference.

But we're talking about a 20% difference in quality compared against a 400% difference in speed.

I keep both loaded and try to direct critical workflow steps that NEED to be right to 27B and ones that can be done twice or have redundant verification and cleanup downstream to 35B-A3B.

If I could get 40-50 tps out of 27B, I'd delete 35B-A3B and never look back. As it stands, each have their place, and more work gets sent to 35B-A3B just because it is 80% as good and 500% as fast.

[-]

jacek2023@reddit

In photography we say that the best camera is the one you use most often.

The best model is the one you use more often, and in your case simply the one that is faster. If a model is too slow you can run it to ask "what is the capital of France?" or "how many strawberries are in R" but it won't be smarter than the faster model, because you will never be able to get smart answer from it, you will just turn it off

[-]

the_derby@reddit

In photography we say that the best camera is the one you use most often.

I thought it was “the best camera is the one you have with you”?

[-]

ReferenceOwn287@reddit

For a lot of tasks, the 35B might works just fine (and it's faster ofcourse), but have you tried out a more complex task. When I asked - "Build me one level of the classic dangerous dave game in an html page" - the 35B model had several bugs each time I tried but the 27B got it right away.

[-]

duirronir@reddit

I prefer 35B (oQ8) over 27B (both MLX) on my M1 Max 64 GB. as long as my prompts are clear and provides direct instructions, 35B is doing a good job. 27B might be better, but it's really slow. Maybe I should try with GGUF, not sure if it'd help tho.

[-]

FortiTree@reddit

I heard oMLX doesnt support warm KV cache? meaning it has 0 prefilled cache hit? if so, prefilled speed is terrible. GGUF with llamacpp is much better. But I dont have Mac to verify this.

[-]

dead_dads@reddit

Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.

My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.

What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.

[-]

crantob@reddit

too many variables to answer with a single model. You will need to learn, old padawan.

[-]

Dizzy_Thought_397@reddit

Combining your RAM and VRAM, you have enough memory to run larger models. The problem is, using the system’s RAM usually causes a significant drop in performance: the token output can decrease by a factor of 8 to 10.

I recommend you start by testing the performance of a model like Qwen2.5-Coder 7B. It should run smoothly on your machine. Use its performance as a baseline, and then test versions with more parameters (Qwen3-Coder-Next would be great option, but it demands a lot).

[-]

Exotic-Tear593@reddit

Curious, what TPS speeds are you seeing with 35B on your Mac Studio?

[-]

pepedombo@reddit

I'm working on already structured code in qwen code and after time I've found myself fixing 35B Q8 frequently so I had no income from its speed. Then I switched to sluggish 27B Q8 and it felt rigid at planning and going straight to the point.

No benchmarks, just a daily feeling after spending time with them both so it depends. Gemma 4 is also able to one-shot something much better than qwen does, but later it fails to the point where you get back to qwen because it's more predictable or maybe it simply suits me more.

Now I'm running two instances in pararrel, 27b q5 for code and 35b q8 for docs/audits/plans/searching/easier tasks.

Checked 27B nvfp4 with few coding tasks against 27b q4-q8 and deleted it 😛

[-]

an80sPWNstar@reddit

Have you tried the mxfp4?

[-]

jonydevidson@reddit

Mac M5 Ultra doesn't exist.

[-]

Snoo_27681@reddit (OP)

thanks, m5 max. I only know to pay attention to ram

[-]

LocoMod@reddit

Doesn’t exactly inspire confidence in your ability to assess the capabilities of an LLM does it? Knowing the capabilities of your own hardware is kindergarten compared to assessing LLM capability.

[-]

LendUMoney@reddit

This isn’t a vetted professional cohort, we’re on the internet sharing our opinions and experiences. Typos and mistakes happen, bud.

[-]

CircularSeasoning@reddit

work mac M5 ~~ultra~~

So you know it's good.

[-]

natermer@reddit

27B is going to usually be much slower because it is a dense model. Were as the 35B is a "mixture of experts" and thus only a subset of the 35 billion parameters is active at any given time.

Because it is dense and all 27 billion parameters are active all the time the 27B model is supposed to be a bit smarter.

But as with everything... your mileage will vary.

[-]

Perfect-Flounder7856@reddit

You know I've had mixed results where 27b bf16 beats 35b but 35b q8 beats out 27b and 27b q4 beats 35b I haven't tried fp8 or nvfp4 yet. On policy reasoning benchmarks I made up ranging between 84-95

[-]

Elegant_Tech@reddit

122B when? Hopefully Qwen actually releases it.

[-]

No_Hunter_7786@reddit

27B is more popular probably because it fits on more hardware. If you have the VRAM for 35B and it performs better for you just stick with it, no reason to downgrade

[-]

etaoin314@reddit

I’ve had a similar experience. The 35b seems competent as hell. Everything just seems to work. with the 27b I’ve had all kinds of set up issues. Granted I’ve been trying all the fancy mods. The base model seems OK just a bit slow.

[-]

Snoo_27681@reddit (OP)

Yeah I keep thinking I'm not using the right parameters or system prompt. 27B dense has to be better than the 3B expert model chosen by the 35B network, right?

[-]

SLxTnT@reddit

Could simply be task. I like to do a basic test to have it convert Firedancer's base58 implementation to C# as that's the typical things I use AI for. There's 3 levels of difficulty on it. A reference implementation, optimized, and SIMD. Only Opus has successfully done the SIMD version so far.

35b fails completely. Couldn't even get a compiled version working after 15 minutes (150+ tokens/s).
27b can complete it with minimal compiler issues. Then another pass to fix failing tests. With some instructions, it can even find some more optimizations.

35b isn't bad. It's simply useless when it comes to my tasks compared to 27b.

[-]

simracerman@reddit

What’s your setup with this?

[-]

Snoo_27681@reddit (OP)

mac studio m4 max 128gb, m5 max 48gb

[-]

SirDomz@reddit

is 48gb not too limiting, coming from the M4 max 128

[-]

Snoo_27681@reddit (OP)

It's both very limiting in the fact that I can actually crash the machine if I (or Claude) forgets to launch the query with a token limit of <100k tokens. And if I'm running other engineering software/scripts the conversation is lesss

[-]

uti24@reddit

It's fair to raise this question.

Benchmarks aside, 35B more often got into loops for me, 35B also implemented features in a naive way, while 27B could figure out tricky parts by itself.

Although both fall apart after about 50k tokens

[-]

Hosereel@reddit

Same experience except in my case it falls off after about 60% of my context window of 256k

[-]

Snoo_27681@reddit (OP)

I use \~50% full context as intelligence limit as a guideline for all models but good to know.

[-]

WishfulAgenda@reddit

I struggled with q6 in both 27b and 35b. Spun up 27b fp8 in vllm on a cloud rtx 6000 and it’s awesome. Now try q6 27b again for tasks with simple instructions and it’s doing ok so far again.

[-]

vevi33@reddit

For me there are cases what Q6 35 MoE can solve but 27B Q4 can't. And sometimes it's the reverse case. 27B understands everything better but since 35B is much faster it's hard to decide. I can do so much more with the 35B even if I prefer the precision of the 27B

[-]

RedParaglider@reddit

With 35b I can run a planning, coding, and dialectical loop in the time it takes to do the planning on 27b.

[-]

LeucisticBear@reddit

The discover AI YouTube channel has a pretty interesting logic problem they hand each model. The moe version did very well on that where the dense model couldn't finish, so like most things in AI world it looks like different models have different strengths.

[-]

bighead96@reddit

My home setup is the same as yours and I’m also using the 35B it produced better results and is much faster. It runs at 80 tokens per second and the 27B has worse results at somewhere around 13-15TPS. Who would anyways use the 27B is what I want to know

[-]

hidden2u@reddit

Am I crazy I thought nvfp4 was only for nvidia gpus

[-]

Reasonable_Friend_77@reddit

My personal experience with both is that 27B is really good at coding and follow instructions. 35B is better at general agentic tasks and more creative in some of the things it says (not for coding). This is based purely on my personal impression and use cases, so not a benchmark by any measure. But 35B is my go to since for coding I generally use Claude.

[-]

My_Unbiased_Opinion@reddit

I tried 35B on my heremes agent. it eventually deleted all my work on accident. I immediately switched. Unsloths UD IQ3XXS with 262K context at kvcache Q8 destroys 35B IQ4XS on a 24GB card. 35B is fast, but if it cant do what its asked, then its a waste of time imho. 35B is really good for research tasks though.

[-]

NNN_Throwaway2@reddit

27b crushes the 35b and it isn't close.

However, LLMs are stochastic by nature so sometimes the results aren't what would be expected.

[-]

SM8085@reddit

I'm enjoying Qwen3.6-35B-A3B-Q8_0 with Hermes agent so far. I decided to see what Nous Research have been working on.

Hooking Mealie up with a mealie-mcp gives it access to my 'recipes' or meals.

I was asking Hermes to make meal plans that matched my calorie quota and so Hermes/Qwen spontaneously decided to write a mealie skill and include the python to calculate which meals fit with each other to match that calorie goal.

The A3B speed definitely helps on my slower hardware. I could run 27B, but I'd be waiting a lot longer. Especially because it's also processing batches of images in parallel, which has like 10k more images to process for that job alone. It's going to be working on that for a while.

[-]

gregpeden@reddit

I think an example of where this can be true is regards to quantization strategy on Limited hardware. If you have 10GB VRAM then you can easily run 35b with 6-bit quantization, but with 27b you're probably looking at 3-bit.

[-]

Main_Secretary_8827@reddit

Gpu? Context?