Qwen3.6 35b-a3b 🤯 | TheaterFire

[-]

our_sole@reddit

I am just stunned how well qwen3.6-35b-A3B MOE is working for me. I have an rtx 3090 24GB VRAM, 64GB RAM on a beelink gti14 Ultra 9185H CPU and the beelink eGPU dock.

I switched from LM Studio to llama.cpp (not because LMS had any issues, I had just heard that llama.cpp was faster and very tunable).

I spent some time tuning llama.cpp with the LLM, got the pi.dev harness running, and started getting great results.

Up until now, local AI was just kind of a playtoy and I used Claude for heavy lifting and Copilot VS Code for medium/light stuff.

I'm getting close to 100 tk/s. I have been trying increasingly more difficult tests/prompts and its handling it fine. It feels close to haiku or maybe sonnet (but not opus obviously). I vibe coded a Flask/Javascript/Tailwind CSS app with local browser storage and it nailed it. Based on my PRD, it even found and added sample data so I could test things.

If i can use it for 60 or maybe/hopefully 70% of my daily ai coding and start to untether myself from the anthropic usage circus, I'll be quite happy. Unlimited tokens are awesome.

There are github PRs for a cache invalidation bug and lack of full MTP support in llama.cpp, which i hope will get merged soon. These should make the setup even better.

Local AI is becoming very powerful. Exciting times! 😁😁

cheers

[-]

sirmeow-meow@reddit

What did you do on lama.cop to get it to 100t/s?

[-]

siegevjorn@reddit

You can use -ot flag to load expert layers to cpu. It needs a bit of engineering but works like a charm.

[-]

tylerhardin@reddit

Not an expert, but I think -fitt superseded the need to tune -ot manually a couple months ago

[-]

siegevjorn@reddit

Ha, thanks will look into thus.

[-]

tylerhardin@reddit

The idea behind the -fit/-fitt args is to automatically offload as much as possible, presumably attempting to offloading the fastest layers first, like expert layers. You can sometimes do better, but it tries to do what you do with -ot for you. Before -fit was added, it had to be done manually with -ncmoe or -ot.

[-]

siegevjorn@reddit

Yes. But -fit doesnt supercede or replace -ncmoe nor -ot. That's my point. What -fit does and -ncome/-ot do are functionally different.

[-]

tylerhardin@reddit

https://grok.com/share/bGVnYWN5LWNvcHk_7989596b-f702-42e7-af82-2355fc1fba55

That's exactly what it does do. That's the whole point. It's solving the optimization problem algorithmically.

[-]

siegevjorn@reddit

Sigh. Referencing to AI without verifying? Can't be serious, people.

[-]

tylerhardin@reddit

I use it. I'm not doing research into something I'm actively using successfully. Better things to do. You could use it too and save yourself some time. Or not, makes no difference to me.

[-]

siegevjorn@reddit

Actually, it makes a lot of difference. Because I'm speaking from experience. If you don't have time to verify yourself, why argue?

[-]

tylerhardin@reddit

Decided to find it for you. You're welcome.

See common/fit.cpp:

    // step 3: iteratively fill the back to front with "dense" layers
    //   - for a dense model simply fill full layers, giving each device a contiguous slice of the model
    //   - for a MoE model, same as dense model but with all MoE tensors in system memory

    // utility function that returns a static C string matching the tensors for a specific layer index and layer fraction:
    auto get_overflow_pattern = [&](const size_t il, const common_layer_fraction_t lf) -> const char * {
        constexpr size_t n_strings = 1000;
        if (il >= n_strings) {
            throw std::runtime_error("at most " + std::to_string(n_strings) + " model layers are supported");
        }
        switch (lf) {
            case LAYER_FRACTION_ATTN: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(gate|up|gate_up|down).*";
                }
                return patterns[il].c_str();
            }
            case LAYER_FRACTION_UP: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(gate|gate_up|down).*";
                }
                return patterns[il].c_str();
            }
            case LAYER_FRACTION_GATE: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_down.*";
                }
                return patterns[il].c_str();
            }
            case LAYER_FRACTION_MOE: {
                static std::array<std::string, n_strings> patterns;
                if (patterns[il].empty()) {
                    patterns[il] = "blk\\." + std::to_string(il) + "\\.ffn_(up|down|gate_up|gate)_(ch|)exps";
                }
                return patterns[il].c_str();
            }
            default:
                GGML_ABORT("fatal error");
        }
    };

[-]

siegevjorn@reddit

Ok, thanks. I see your point now.

You're right about that -fitdoes do some optimization for moes. For instance, -fit does keep all the active weights to GPU.

But what it doesn't also matters for speed. For instance, all the attention weights, kv cache. If those are left on CPU, that's a performance hit—and -ot allows to load all those weights onto GPU, so is better optimization for speed.

I assume you've got a dual symmetric GPU system. In dual GPU system, performance difference between -fit and -ot may be minor. But when you have a single fast GPU, the advantage of -ot become much significant.

[-]

tylerhardin@reddit

I'm running exactly the type of system you are, actually -- GLM 5.1 on a single RTX 6000 and an Epyc. It works as well for me as the tool I wrote before (better actually). The reason it tags different layer types in the code is that there's a priority hierarchy. It puts the most important layers (for speed) on gpu, then moves down the hierarchy, packing as many as possible. It tries to do what you've been doing automatically. The biggest issue I've had with it is that sometimes its estimate is wrong and you get a CUDA OOM error, in which case you have to set -fitt higher. It's actually really hard to predict the exact total CUDA allocation needed for inference (my tool was very tedious to debug, hence why I was happy to abandon it).

Have you tried it since we've been chatting? I bet it'll be basically the same.

[-]

siegevjorn@reddit

Oh, I see. Yes, i understand. Using -ot could be really painful with oom.

Yes, i have tried. I have a rtx 4090 and a 5060 ti. I saw a notable speed difference when running MOEs -fit vs -ot, like 20-30% speed difference favoring -ot.

[-]

tylerhardin@reddit

I'm not arguing. I'm telling. I wrote a tool myself to deduce optimal -ot args before replacing it with -fit after llama added it. We're not on the same level. You use tools. I make them. I don't debate people below me. You can take the info from your better or not. I'm not wasting more time on you.

[-]

Southern-Expert22@reddit

Cpu? I load active via gpu and pin the rest on ram, is that what you mean or something else ?

[-]

siegevjorn@reddit

Yes expert layers are inactive. By definition.

[-]

huzbum@reddit

IQ4_NL fits comfortably on my 3090 with 256k context with Q8 KV cache. 110tps.

[-]

AdIllustrious436@reddit

I hit 90 to 110 tok/sec on a single 3090, 130k context. I could probably go up to 262k with agressive KV quantization. Gemini Flash 3 intelligence and throughput level. Mind-blowing.

[-]

Dependent-Guitar-473@reddit

What did you change exactly by saying "fine-tune it"
I am a beginner, I used 9B model (20T/s) and 35B (64t/s)
I would love to get more out of them

[-]

Public_Umpire_1099@reddit

I have basically your setup minus the GPU for my homelab. Even on that hardware, I still get bare minimum 30 tokens a second on properly quantized MTP models at Q4. Legitimately usable.

[-]

FormalAd7367@reddit

just curious - why you prefer A3B over 27B?

[-]

AdIllustrious436@reddit

3x faster throughput and only a marginal intelligence difference in my testing. The catch is fitting the full context on one 3090.

[-]

DeSibyl@reddit

Curious if you think it would be good as a general assistant? Right now I’ve been using Gemma 4 31B as my daily general assistant, only get about 30 t/s. I tried using Qwen 3.6 27B since I can get higher context and also use MTP to get 70 t/s, but it sometimes would get stuck in a loop thinking. Often enough I switched back to G4… I mainly use it for work, proofreading emails, asking it to create drafts based on pictures of info and such. Maybe some coding stuff

[-]

jopereira@reddit

It's interesting... I have yet to encounter a single loop with 27B IQ3 XXS (using turbo3 !!!). But I'm ALWAYS in no thinking mode. It solves every single problem I through at it!

[-]

AdIllustrious436@reddit

I've also had the thinking loop on 35B. It usually recovers, but it's not ideal. Never tried G4 (I mainly do agentic dev), but from what I've read, most people rank G4 26B-A4B as the best all-rounder in this weight class. The 31B just feels too heavy for my setup to run comfortably.

[-]

DeSibyl@reddit

Yea fair enough. I’m just worried a 3B active or 4B active MoE isn’t going to be smart enough to pull data from pictures correctly. Screenshots I send contain a decent amount of numbers, so I’d like it to be accurate, and reliable. (I always double check but still)

[-]

lumos675@reddit

It's pretty smart even for coding. I set it as a service when my computer turns on the model loads as my personal asistant for everything. I get 210 token on 5090

[-]

AdIllustrious436@reddit

PS :

This alone makes me question the claim tho...

[-]

bnightstars@reddit

My 35B is actually working great with Claude Code as harness but you need hardware that chan handle all the Prefill tokens Claude Code loves to spend. And the llama.cpp cache invalidation issues are not helping with that.

[-]

our_sole@reddit

You have claude code running against local qwen3.6-35b-A3B running under llama.cpp?

Could you share your claude shell script or bat file that does this (the env vars, --model, config, etc..)?

I tried for quite some time to do this and claude just flatly refused to use the model. It saw the model, but wouldn't use it: "There's an issue with the selected model..it might not exist or..."

[-]

bnightstars@reddit

I run on MacBook so MLX but overall is just of ENV vars that point Claude Code to a local LLM the DISABLE_NONESSENTIAL_TRAFIC one is key.

"ANTHROPIC_BASE_URL": "http://127.0.0.1:8000",
"ANTHROPIC_AUTH_TOKEN": "123456",
"ANTHROPIC_DEFAULT_OPUS_MODEL": "Qwen3.6-35B-A3B-UD-MLX-4bit",
"ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.6-35B-A3B-UD-MLX-4bit",
"ANTHROPIC_DEFAULT_HAIKU_MODEL": "Qwen3.6-35B-A3B-UD-MLX-4bit",
"CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"

[-]

Opposite-Station-337@reddit

Just the first section variables got me going. Been using pi agent though.

https://www.reddit.com/r/LocalLLaMA/s/u0Fuj1kdBC

[-]

our_sole@reddit

Thanks much! I'll test this again today.

Cheers

[-]

superdariom@reddit

Which quant are you using?

[-]

our_sole@reddit

The unsloth dynamic UD-Q4_K_XL

[-]

Altruistic-Dust-2565@reddit

What about 262K context speed? That's like the minimum for usable coding now.

[-]

Cyber_Ghost@reddit

I find Gemma the most useful model for me for most knowledge related tasks, and helps me pretty good with translation and grammar (learning Italian).
I wanted to do some evaluations on the models using custom tests, so I let Claude Code build a test suite for doing it. I wanted to compare Gemma4 26B-A4B at FP8, Gemma3 31B at Q5 but now I'll also add Qwen3.6 35B-A3B as well. Sounds like an interesting idea to test.

It's running against 26B now on 2xB70 cards at max context.

[-]

nickless07@reddit

Can you add the qwen 122B too?

[-]

Cyber_Ghost@reddit

122B-A10B finished, at Q3 it wasn't very good in the tests. You can check the github for Claud's conclusions.

[-]

nickless07@reddit

Oof damn. I thought it is only bad a coding, but would perform better in such tasks due to the world knowledge.

[-]

Cyber_Ghost@reddit

I expect it to be better at a reasonable quant.

[-]

nickless07@reddit

Idk man. I can run only Qwen3.5-122B-A10B-UD-IQ2_XXS with \~4 token/s and for the few runs I used it the writing style was much better then Qwen3.6 in q8. I know MoE suffer more then dense models from quants, but for me it was pretty descent even in that low quant. Then I read everywhere that it is bad with code (not my usecase at all) and it is hard to find any tests that doesn't aim for coding. I really hope we get that as Qwen3.6 too, however it looks like we are out of luck there.

[-]

Cyber_Ghost@reddit

Did you try Gemma for writing?

[-]

nickless07@reddit

Yeah and whenever that one got stuck, I switched to the 122B to fix the problem. For Example is was working on a prompt with Gemma 4 for hours, a lot of back and forth, then was running in circles, switched to qwen3.5 and got it done in a single shot. It is just, my english isn't the best and where Gemma beats around the bush, the 122B qwen jumps in straight with the right phrases.

[-]

Cyber_Ghost@reddit

Just an FYI - Managed to get 122B-A10B to load at Q3, going to run the tests now.

[-]

Cyber_Ghost@reddit

I don't think I have the VRAM to run it in a good config, but I'll try to run Q3 and see what I get from it.

[-]

Cyber_Ghost@reddit

So, Claude took a while to finish all the tests on the 2 Gemma models and one Qwen.
I've had it dump everything into github.
I'll write a short post about it later but in general Gemma 31B seems to be the winner in the types of tests Claude made with Qwen3.6 35B-A3B falling behind. I'll try to get 122B-A10B running on my setup to test too but not sure how it'll perform on my system.

https://github.com/pelegw/llm-eval/blob/main/ANALYSIS.md

[-]

No-Juggernaut-9832@reddit

Friend don’t let friend use Ollama. Llama.cpp or omlx (if you are on Apple)

[-]

Southern-Expert22@reddit

Use yarn and Google turboquant to get a 1 million context window, do --no mmaps, I telling this model is better than opus and you get the 1 million token window without it losing track of big projects.

[-]

cell-on-a-plane@reddit

Thanks!

I’ve been having good luck on vllm on my 5090.

[-]

FIdelity88@reddit

Why not llama.cpp? It supports MTP now

[-]

cell-on-a-plane@reddit

Why not vllm?

[-]

FIdelity88@reddit

Better performance

[-]

Karyo_Ten@reddit

Try running 10 concurrent agents on llama.cpp and watch perf flounder while vLLM amd SGLang can do aggregate 1000+ tokens/seconds

[-]

FIdelity88@reddit

On what hardware would you run 10 concurrent agents with 1000+ tk/s?

[-]

Karyo_Ten@reddit

2x RTX 5090 can. I can reach 1500 tok/s on 2x RTX Pro 6000.

[-]

scooter_de@reddit

Not in the official branch though

[-]

BraceletGrolf@reddit

I assume you run it quantized, or can you share your setup ?

[-]

cell-on-a-plane@reddit

```vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound --dtype half --max-model-len 131072 --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --max-num-seqs 2 --limit-mm-per-prompt '{"image": 0, "video": 0}' --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --port 8080 --host 0.0.0.0 --trust-remote-code --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --compilation-config.cudagraph_mode none --enable-prefix-caching```

[-]

DaMoot@reddit

Congrats on getting it working for your needs! For me it was a dumb, looping mess even with the correct model card settings. Wasn't even all that fast in llama.cpp.

Help out the rest of us; what are you actually using 35b-a3b for? Where is it succeeding, where is it failing?

For me Qwen3.5 27B is the current best to fit in 32gb for email scanning, classification and alerting, SIEM log processing, ticket generation/changes in ticketing system, light vibe coding (I do all my vibe on Claude), SQL access, web searches... 3.6 seems to have introduced some looping.

[-]

cognitium@reddit

Did you try 35b with thinking turned off? It wasn't usable for me until I turned it off. And it's 8x faster than 27b on my card.

[-]

DaMoot@reddit

You know what, I have not! I will give that a try this afternoon for my normal workflow and see how things go.

Thanks!

[-]

Civil-Reporter7812@reddit

Why are you still using Ollama? And I mean, seriously, why? It's rotten software with worse performance than llama.cpp

https://sleepingrobots.com/dreams/stop-using-ollama/

[-]

bighead96@reddit

There's a common belief that Gemma4 is very smart, its not, its actually very dumb. It's very good at confidentally telling you its fixed things and here are the issues and how it resolved them. If you create a bunch of bugs and ask them to fix them it will confidentally tell you it fixed them and almost none will work properly. Its like that friend you have that is dumb as a box of rocks but tells you they are an expert at everything. And they will look you dead in the eyes and be like trust me bro. No I'm not trusting you, because you break everything you touch as far as coding goes

[-]

Maleficent-Ad5999@reddit

I find Qwen 3.6 35b often deviating from my instructions despite not having too much context. For instance I tell it “ok, just focus on addressing #1 from your suggestion” and it proceeds to fix all of them

[-]

cleversmoke@reddit

Same, that's why I moved to 27B, 27 tok/s vs 80+ tok/s on the 35B-A3B, but 27B is spot-on on following instructions and providing great output. 35B-A3B was a lot of rolling the dice for a good seed, in my early experience with it.

[-]

Imaginary-Unit-3267@reddit

Yes, you have to explicitly tell it what not to do, not only telling it what to do. But if you give it explicit "no" instructions, it follows them as long as it doesn't have such a bloated context that it's getting generally degraded cognition.

[-]

bighead96@reddit

oh wow interesting! I haven't been using it much yet, but will have to see if that's the case. I started finding people say to just use the 27B that its much better

[-]

rpkarma@reddit

I was shocked at how consistently bad Gemma 4 31B is at code analysis on my eval suite. Makes up the same fake bugs over and over lol

[-]

dim722@reddit

As many others, I was initially impressed with Gemma on my mediocre setup, but then I realized one thing: Gemma is not smart - it just gives the impression of being smart through fast output and oversized responses. It’s all about talking, nothing else.

The model has terrible tool discipline and is basically incapable of applying edits, no matter what harness you try to use - and I’ve tried them all, including tricked versions of Claude Code and Codex. It seems that all Gemma-4-class models inherited the same tool-related issues, since dense Gemma exhibits the same behavior.

[-]

thisguynextdoor@reddit

In having continuous reasoning loops with Qwen. It's almost unsable. You could also try Gemma 4 26b with the multi-token prediction assistant. It will speed up Gemma 2-3x.

[-]

cmndr_spanky@reddit

You almost certainly have param settings / context window settings wrong with Qwen. It does think for a while but not like that.

[-]

thisguynextdoor@reddit

I'm running it on LM Studio with default settings. Context window of 128k. I just today stopped it for looping after 50k of endless reasoning

Final answer:...
Am I consice?
Wait, user might mean...
I should reconsider
OK, final answer
No, wait..

[-]

cmndr_spanky@reddit

The fact that you couldnt answer my question tells me everything I need to know :)

[-]

thisguynextdoor@reddit

You haven't asked anything.

[-]

cmndr_spanky@reddit

What parameters are you using?

[-]

yeah-ok@reddit

I've been strict on --no-reasoning lately and having plenty of success with one-shot programming extensions for pi agent..etc..etc, think we have to remember that the top-k is in a sense a selection out of what is already a latent thought process in the model.

[-]

onewheeldoin200@reddit

You for sure have to use the recommended settings from Qwen team. Unsloth etc all have them on their HF pages too.

[-]

FranticBronchitis@reddit

It sure helps but it's still noticeably loopier than gemma4

[-]

DR4G0NH3ART@reddit

Have you put a repeat penalty and a reasonable temperature?

[-]

No_Swimming6548@reddit

What default settings? Qwen has its own ideal parameters for both coding and general purpose use.

[-]

the_fabled_bard@reddit

Their coding settings (which I use) actually encourage thinking loops. It happens often. I just stop it and say "you were spiraling" and it almost always picks up where it was, recognize it needed to shift perspective and gets the task done.

But it does happen often. Letting it run a long time unattended is recipe for failure.

[-]

Snoo_81913@reddit

Set reasoning at 4096

[-]

EffectiveMedium2683@reddit (OP)

That's interesting. I haven't seen that. I did see it with qwen3.5. In the past, when that was a major issue with most, I just used a small fine-tuned model to watch it and if it seemed to be looping, inject a message to stop thinking or even just force the tag. Have you tried adding something like "If at any point you are uncertain, just ask me. I won't bite."? I know that sounds ridiculous, but these CoT models have anxiety or something haha

[-]

Snoo_81913@reddit

What this guy said also make sure reasoning is set to 4096 to prevent over thinking. Adjust as to your needs of course. Then if you're using turboquant_plus with qwen models there are some k/v settings that will cause.????? Or ////// stick with turbo3 for v and Q8 of q4 for k and that shoukd stop it.

[-]

sid351@reddit

Do you know more about these k/v settings?

I'm running non-turbo quants and getting frequent (6 times today) "terminal thinking loops" where the token generation gets stuck just repeating "/" endlessly until the maximum length is hit for the prompt.

I'm running llama.cpp on Windows, and I have a post where I've detailed my setup and things I've tried so far.

[-]

Snoo_81913@reddit

Sure put the link in the post and I'll take a look.

[-]

sid351@reddit

Amazing, thanks:

https://www.reddit.com/r/LocalLLaMA/s/oCHseapcdr

[-]

Snoo_81913@reddit

Alright man, I took a look. I'm pretty sure I know what's going on and I've got a pretty decent way to test it that won't take too long. Give me like an hour or two and I'll do a pretty big post over there with everything.

[-]

sid351@reddit

Omg, you're a saint.

[-]

Snoo_81913@reddit

It's up.

[-]

anykeyh@reddit

You need to take time to tweak it for params of your taste. Honestly I was like you early, but after a bit of trial and error I am now very happy; no more forever thought;

Here are my parameters:

--temp 0.7 --top-k 20 --top-p 0.95 --min-p 0.01 --repeat-penalty 1.03 --frequency-penalty 0.05 --presence-penalty 1.5

I also keep reasoning budget to 4096 tokens.

[-]

SnooPaintings8639@reddit

What inference engine you're using? vLLM, llama.cpp, or something else?

[-]

DaMoot@reddit

Which Qwen?

Do you have the model launching with the recommended settings? That makes a big difference running it with no settings (just launch defaults) and correct settings.

[-]

vanbukin@reddit

Try latest vN chat template https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main/qwen3.6

[-]

thisguynextdoor@reddit

Thanks! I'll try it right away.

[-]

abnormal_human@reddit

I'm seeing the same. It just thinks for 100k+ tokens, repeatedly backtracking.

[-]

siegevjorn@reddit

After llama.cpp MTP PR Qwen3.6 speed is truly insane.

[-]

JsThiago5@reddit

How much faster is it in comparison to Gemma4 26b?

[-]

EffectiveMedium2683@reddit (OP)

On my optiplex 3000 intel 12th gen i5 (Alder Lake) setup with zero GPU, I'm getting 12 tokens per second on long context. Like, it doesn't slow down. Gemma 4 26b-a4b, once it gets past like 10k context, I start seeing it slow down from \~11 tokens per second all the way down to like 8 tokens per second.

[-]

Living-Office4477@reddit

ddr4 ram?

[-]

EffectiveMedium2683@reddit (OP)

Yeah. It had a 16gb stick and I added an 8gb stick I found. Shockingly decent computer for llms. 75 watt power supply. I can run it literally for days of non-stop inference creating custom datasets and it doesn't even get hot.

[-]

boutell@reddit

This setup was not on my bingo card!

[-]

Living-Office4477@reddit

Super cool, you kind of gave me hope to try too, any other models you have tried and liked on that hardware?

[-]

virtualPNWadvanced@reddit

Are you able to tell the difference between 11-8 tks? Unless I’m paying SUPER close attention, I don’t know if I’ve dropped speed

[-]

EffectiveMedium2683@reddit (OP)

Honestly, not really. I just know from the output where it shows prefill and decode speed.

[-]

Barry_22@reddit

Wait what

Isn't Qwen 3.6 35B MoE MUCH more inyelligejt than gemma 26B MoE? Coding-wise, brevity-wise, just in general?

[-]

Organic_Scarcity_495@reddit

the ollama vs llama.cpp performance gap on moe models is real. ollama's default settings don't handle the expert routing well on desktop-class hardware. running through llama.cpp directly with tuned batch size makes a big difference.

[-]

DiscipleofDeceit666@reddit

What do you use it for?

[-]

koygocuren@reddit

Qwen’s answers are not the issue. The issue is its enormous thinking blocks

[-]

yami_no_ko@reddit

Not polluting your system with bloat like ollama is a valuable lessen learned.

I cant tell for sure whether I prefer Qwen3.6. It's my go to for programming, but Gemma-4 performs better with language and knowledge tasks in the context of western culture.

[-]