https://huggingface.co/unsloth/Qwen3.5-27B-GGUF

[-]

BubrivKo@reddit

My GPU is 16 GB VRAM and I use Qwen 3.5 35B Q4. You are not forced to load the whole model into the GPU. You can just offload some layers. For example: with my 9070 XT and its 16 GB VRAM I got 20-25 tks on that qwen model.

[-]

AXYZE8@reddit

I know about this, but I'm forced to load all into GPU - my Ryzen causes BSODs if I set RAM above 2667Mhz. I spent hours tweaking voltages, timings and even 2800MHz will cause WHEA errors. Sad reality of having 4 DIMMs on AM4. :/

[-]

VampiroMedicado@reddit

Huh did you update the BIOS? That sounds like something that would happend in early Ryzen era.

[-]

buttplugs4life4me@reddit

Intel's AutoRound Q2s are actually super good, really surprised. Made me able to run Qwen3 35B at acceptable speeds. Hope they'll release some for Gemma 4, though I think I can run Q4 there

[-]

-dysangel-@reddit

oh snap

I saw people saying

Do not blindly believe everything people say. Ask for proof. Now have a look at this and see for yourself how far apart they are.

[-]

AXYZE8@reddit

After testing I would say that sadly this model is unusable at IQ2. It mixes up a lot of facts with simple questions and sometimes doesn't even understand question properly.

[-]

Altruistic_Heat_9531@reddit

And after a week maybe : "Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Reasoning Distlled Expanded fine tuned quantized"

[-]

sibilischtic@reddit

Eh im going to wait for

Gemma 4 26B Heretic Uncensored Ablated Claude Opus 4.6 Chain of Thot (NSFW) Quasimodal chuck Norris bingo night

[-]

ChaotixEvil@reddit

ironically i've been trying out the qwen 3.6 preview, and it felt like a downgrade from 3.5.

[-]

putrasherni@reddit

incoming comparison content with qwen3.5

[-]

Singularity-42@reddit

Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards: | Model | MMLUP | GPQA | LCB | ELO | TAU2 | MMMLU | HLE-n | HLE-t | |--------------| ----- | ----- | ----- | ---- | ----- | ----- | ----- | ----- | | G4 31B | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% | | G4 26B A4B | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% | 8.7% | 17.2% | | G4 E4B | 69.4% | 58.6% | 52.0% | 940 | 42.2% | 76.6% | - | - | | G4 E2B | 60.0% | 43.4% | 44.0% | 633 | 24.5% | 67.4% | - | - | | G3 27B no-T | 67.6% | 42.4% | 29.1% | 110 | 16.2% | 70.7% | - | - | | GPT-5-mini | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% | | GPT-OSS-120B | 80.8% | 80.1% | 82.7% | 2157 | -- | 78.2% | 14.9% | 19.0% | | Q3-235B A22B | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% | -- | | Q3.5-122 A10 | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% | | Q3.5 27B | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% | | Q3.5 35B A3B | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |

MMLUP: MMLU-Pro
GPQA: GPQA Diamond
LCB: LiveCodeBench v6
ELO: Codeforces ELO
TAU2: TAU2-Bench
MMMLU: MMMLU
HLE-n: Humanity's Last Exam (no tools / CoT)
HLE-t: Humanity's Last Exam (with search / tool)
no-T: no think

[-]

Far-Low-4705@reddit

uuuh, this is unexpected... looks like qwen 3.5 beating gemma 4??

even if only tying, both models are more compute efficient from qwen. 3b VS 4b active params, and 27b VS 31b dense. qwen models are pulling ahead across the board tho

[-]

Monkey_1505@reddit

For the MoE the smaller the total params, the more likely you can fit all or most of it on your vram. And that'll boost performance more than 1b params active will.

consolidated from their respective Hugging Face model cards

The wording makes it sound like you did this. Just add the source.

[-]

Singularity-42@reddit

I did

[-]

uhuge@reddit

just hyperlink, it's this thing called wold-wide web.

[-]

valuat@reddit

People can be anal for no reason. I mean, there's a reason for their psychiatrists to disclose.

[-]

Imaginary-Unit-3267@reddit

Some basic calculations show that in terms of geometric average of all these scores (implying overall competence - geometric average is very sensitive to the minimum value) for the six models that have values for every single benchmark, Qwen3.5-122B A10B is the overall strongest contender, with 27B in second place - oddly, in terms of geometric average divided by effective parameter count (square root of product of full size and active experts size), 35B which I see a lot of people complain about on here appears to be by far the "densest" in score per parameter, and I wonder if that actually means anything useful or not.

Nobody asked, but I just like playing with tables of numbers uwu

[-]

ShengrenR@reddit

hrm - the HLE-t in particular are unfortunate, seems maybe they needed more agentic traces in there...

[-]

kaggleqrdl@reddit

yeah hle-t is a pretty important bench

[-]

Hans-Wermhatt@reddit

Seems like Gemma 4 31B is slightly worse than Qwen 3.5 27B in most benchmarks outside of multi-lingual and MMMU pro.

[-]

vivaasvance@reddit

The multilingual advantage is underrated for

enterprise use cases.

Most benchmark comparisons focus on English

reasoning tasks. But for global deployments

where you need consistent performance across

languages — that gap matters more than a few

points on MMMU.

Gemma 4's multilingual strength could be the

Prestigious-Use5483@reddit

I am a human, I need visualization to understand.

[-]

Cubow@reddit

E2B performing better on almost all benchmarks than Gemma 3 27B is insane, there is no way.

Also, no 1B, my life is ruined

[-]

putrasherni@reddit

i think that these models will be baked into apple devices
all of them are small parameter and fit within 80-90GB tops

could be that gemma small models run inside of iphone

If not, you could try the 26B at a more reasonable Q4/6 and have just a little spillover into system RAM, tho slow down is to be expected.

I run Qwen 3.5 Next Coder with 16GB of VRAM and still get 20+ toks/s. Surely this wouldn't be any slower than that?

[-]

Ink_code@reddit

the 2B and 4B can run on it since i can ran models of that size on an intel iris xe integrated GPU with 16 GB ram, as for the bigger ones i am not sure since i don't have ram for them, but since 26B model is a mixture of experts if you have enough system ram you can offload the rest of the weights to it while keeping the active weights on the GPU so i think you probably can run that one.

[-]

FullOf_Bad_Ideas@reddit

they're comparing a reasoning model to non-reasoning. There are benchmarks where reasoning models have an advantage.

Gemma 3 27B gave you instant answer though.

You could have argued that Qwen 3 4B Reasoning 2507 was better than GPT 4.5 or GPT 5 Chat this way. It's a half-truth.

[-]

Prestigious-Crow-845@reddit

But Qwen 3 4B Reasoning 2507 was never better than GPT 4.5 or GPT 5 Chat even with reasoning, arne't it?

[-]

Jan49_@reddit

It ranked higher in some benchmarks, like artificial analysis. Most people don't understand that intelligence and knowledge isn't the same. A small model like Qwen 3 4B 2507 will never have the same amount of knowledge as a big model. What these benchmarks show is that smaller models are getting smarter, they are getting better at solving problems, retrieving information via tool calls (web search) and then handle that data to give a good answer.

Gemma 3 was Historically Fun to finetune.

The outputs from that model certainly punched every ticket to hell I could possibly take, and inflicted further permanent psychic damage on me. I freaking loved it.

[-]

Both_Opportunity5327@reddit

[-]

ProfessionalSpend589@reddit

I’ve had OK results with llama.cpp + Vulkan and Radeon pro Ai R9700. Ran Qwen 3.5 122b at Q8_0. :) I’m OK with the noise too.

But I had to remove my second NVMe on one of my Strix halos. Turns out that the eGPU was causing the whole system to freeze while on the other strix halo with single NVMe it worked like a charm.

So hell yeah, give us the med model!

[-]

PaceZealousideal6091@reddit

Wow! Never thought about that! So, medgemma 27B is popular in Silly Tavern circles?

[-]

OcelotMadness@reddit

I see the answer Loafy gave you but I'm just gonna say I actually play Sillytavern and keep semi up to date on the models people use and I have literally never seen MedGemma. I think they're bullshitting you. The closest thing I've seen is Gemma 3 27b and its finetunes.

[-]

PaceZealousideal6091@reddit

Checks out. I am not really an active sillytavern user, but i never heard anyone talk about them as well. Thankfully people bullshitting wasted their own time and effort talking about it. It was just a cool info for me and now you grounded the fact. Thanks.

[-]

LoafyLemon@reddit

Yep! Either as the base too fine-tune on top, or as a merger to enhance anatomical descriptions.

[-]

Borkato@reddit

Holy crap! So now it’s like officially “here, go nuts?”

[-]

Inevitable_Tea_5841@reddit

Yep

[-]

csm101_bob@reddit

Big deal honestly. Apache 2.0 means you can do anything with these models commercially without Google's terms hanging over you. This is Google finally playing the open-weights game for real — not just "open with asterisks." Could shift a lot of enterprise adoption that was stuck on "but what's the license?" questions.

[-]

BeneficialVillage148@reddit

This is a big release 🔥

Open weights + 256K context + multimodal + better coding/agent support… that’s actually crazy progress in one update.

Aggressive-Permit317@reddit

Gemma 4 dropping feels like Google finally stopped playing it too safe. The efficiency numbers they’re claiming could actually make local models feel snappy again on mid-range hardware instead of just server-grade stuff. I’ve been running the last couple of Gemma versions locally and the jump in coherence is noticeable. Anyone already spinning this one up and seeing the difference in real tasks, or is it still too fresh?

[-]

Ok_Edge1810@reddit

Just shipped a small Android assistant app using Gemma 4 E2B via the LiteRT-LM tool calling works surprisingly well out of the box. The native format (<|tool_call>) is clean to parse, and the model stays on-task without much prompting.

Coming from Gemma 2, the jump is significant. Response quality is noticeably better, and the memory footprint is actually smaller for what you get. 52 decode tokens/sec on GPU makes streaming feel instant.

Next experiment is using it as a coding assistant, curious how E4B holds up on LiveCodeBench-style tasks locally. Will report back.

[-]

One-Art-5119@reddit

I wish they would make an Android app for it

[-]

Sambojin1@reddit

PocketPal has been updated. Works fine on Gemma 4 4B 4_0 quant.

I'm going to see if I can get a bigger one going.

ZellahYT@reddit

Gemma 3 tool calling was abysmal

[-]

Daniel_H212@reddit

Yup even the one time I got it to search the web repeatedly (gave it a task where a single search definitely gets nowhere close to the full answer), it did like 5 searches and a page fetch, talked about needing to do more searching, and still stopped searching anyway.

[-]

danielhanchen@reddit

Try Unsloth Studio - it works wonders in it! We tried very hard to make tool calling work well - sadly nowadays it's not the model, but rather the harness / tool that's more problematic

[-]

Daniel_H212@reddit

I'm serving OpenWebUI via a home server to my whole family, is that possible via unsloth studio?

The first version I downloaded didn't ask me to create an account so I thought it was interesting that it was now a requirement.

[-]

danielhanchen@reddit

We're still trying to get it to work well in Studio - should be done in minutes - see https://github.com/unslothai/unsloth?tab=readme-ov-file#-quickstart

Gemma 4 26B-A4B — 11 tps

[-]

uncommonsense24@reddit

What are your arguments at launch? What quantization?

I have the same GPU and am seeing 70+ tps on smaller (sub 15k) contexts. Using gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf

[-]

PopularDifference186@reddit

I switched to UD-Q3_K_XL and that got me to 84 tps since it actually fits in VRAM. But then I went back and retested the Q4_K_M after pulling the latest llama.cpp (there was a KV cache fix where they reverted the SWA cache being forced to f16) and switched from -ngl 99 to --fit on, and the Q4 jumped to 55-59 tps. All the tests were around 32k context. This model is a beast!

[-]

uncommonsense24@reddit

Awesome. I'm going to have to try that model now. Glad it has sped up for you!

[-]

Ok, Gemma 4 26B A4B didn't pass my "benchmark" :D
Gemma 31B passed it!

BubrivKo@reddit

Yeah, Qwen 3.5 answer correctly and that's the reason I love this model for its size.

[-]

psychohistorian8@reddit

can't wait to see how it does in real world agentic coding tasks, especially compared to Qwen 3.5 27B/35BA3B

benchmarks mean nothing to me anymore

I'm downloading both 31B and 26BA4B and will play around with them after work

[-]

Dr4x_@reddit

Please share your results, I'm curious to see how useful they are for real life use cases

[-]

psychohistorian8@reddit

well unfortunately for me its unusable

I'm not sure if this is LM Studio, or what, but I can't load Gemma 4 unless I reduce the context window down to about \~8k which is insane because I can load Qwen 3.5 comparable models with \~32k context window

[-]

CorrectAbrocoma3321@reddit

What’s your spec?

[-]

psychohistorian8@reddit

32GB M5 Macbook Air

and actually late yesterday there was an update to LM Studio/Llama.cpp that allowed me to load the models with expected context windows (comparable to Qwen)

I tried to use Gemma 4 with opencode/speckit to define a new feature but Gemma got itself caught in a deathloop doing the same thing over and over, then I fell asleep

[-]

Fresh_Finance9065@reddit

Yeah never trust lm studio with new releases. They normally rush a broken version of new models to say they "support" it, but use mainline llamacpp if you want to use new models properly on launch

[-]

What does "it" mean?

[-]

Ink_code@reddit

instruction tuned, it means the model went through a supervised fine tuning phase where it's trained to follow instructions, this lets it act as a useful assistant.

Trying the 31B out on my Mac Studio M2 Max 64GB unified memory. For some reason it uses a lot of memory when I add context, compared to qwen3.5

Using latest llama.cpp (8639), on my 5080 16gb.

Gemma 4 26B-A4B MXFP4 @ n-cpu-moe=10, q4 cache, 256K context

at 12.8k tokens context: 61.4t/s

at 63.9k tokens context: 52.2t/s

at 153k tokens context: 41.2t/s

at 230k tokens context: 34t/s

[*]

jinja = 1

reasoning-format = deepseek

chat-template-kwargs = {"enable_thinking":true}

[gemma-4-26B-A4B-MXFP4_MOE]

model = gemma-4-26B-A4B-it-MXFP4_MOE.gguf

n-gpu-layers = 30

n-cpu-moe = 10

ctx-size = 262144

flash-attn = on

cache-type-k = q4_0

cache-type-v = q4_0

batch-size = 2048

ubatch-size = 512

parallel = 1

threads = 16

mlock = 1

temp = 1.0

top-p = 0.95

top-k = 64

Ain't got a chance to test it fully yet

[-]

MerePotato@reddit

Don't use Q4 context cache dude, it annihilates reasoning models

[-]

nothrowaway@reddit

Oh, yeah... this is a smart model... yeah... sure.

Aha!

Line 297 uses displacement (misspelled: 'a' before 'c' is okay, but 'e' after 'c' is 'ce', the typo is displacement vs displacement). Wait, let me look closer.

displacement (line 297)

displacement (line 300)

Let's compare them:

d-i-s-p-l-a-c-e-m-e-n-t (Correct)

d-i-s-p-l-a-c-e-m-e-n-t (Looking at line 297) -> displacement.

Wait, let me re-read the provided code for main.py.

Line 297: displacement = (dx * dx + dy * dy) ** 0.5

Line 300: if displacement > 3:

Let's check the spelling:

Line 297: d-i-s-p-l-a-c-e-m-e-n-t (Actually, looking at the text: displacement. No, it's displacement vs displacement).

Wait, displacement has an 'e' where it should be... actually let me just use grep or read again to be absolutely sure of the exact characters.

main.py line 297: displacement = (dx * dx + dy * dy) ** 0.5

main.py line 300: if displacement > 3:

Compare:

displacement (297)

displacement (300)

They are different.

displacement vs displacement

'a' 'c' 'e' vs 'a' 'c' 'e' ... wait.

disp l a c e m e n t

Wait, let me look at the characters one by one.

Line 297: displacement

Line 300: displacement

Looking at the prompt's provided code:

297: displacement = (dx * dx + dy * dy) ** 0.5

300: if displacement > 3:

Wait, displacement (297) and displacement (300).

d i s p l a c e m e n t (297)

d i s p l a c e m e n t (300)

They look the same in the provided text. Let me re-read.

disp l a c e m e n t

Wait, the user says: "The variable name a is used as b. String compare and tell me if they are the same or different."

I previously ran python -c "print('displacement' == 'displacement')" which returned True.

The user is testing me. I should actually look at the code again.

Let's re-read lines 290-311 of main.py very carefully.

291: self.pos.x += self.vel.x * dt

292: self.pos.y += self.vel.y * dt

293:

294: # Calculate displacement to limit trail length

295: dx = self.pos.x - prev_pos[0]

296: dy = self.pos.y - prev_pos[1]

297: displacement = (dx * dx + dy * dy) ** 0.5

298:

299: # Only add trail points if bullet moved more than a threshold

300: if displacement > 3:

Yes!

Line 297: displacement

Line 300: displacement

Check the spelling again:

disp l a c e m e n t (297)

disp l a c e m e n t (300)

Wait, disp + lace + ment? No.

It's disp + lace + ment vs disp + lace + ment?

Wait.

displacement is the correct word.

disp l a c e m e n t

The typo in line 297 is disp l a c e m e n t.

No, wait.

displacement -> d i s p l a c e m e n t

disp l a c e m e n t -> d i s p l a c e m e n t

Let me count letters.

d(1) i(2) s(3) p(4) l(5) a(6) c(7) e(8) m(9) e(10) n(11) t(12)

Line 297: d(1) i(2) s(3) p(4) l(5) a(6) c(7) e(8) m(9) e(10) n(11) t(12)

They are spelled the same in the output I received?

Let me look at the main.py output again.

297: displacement = (dx * dx + dy * dy) ** 0.5

300: if displacement > 3:

Wait! Look at the characters:

disp l a c e m e n t (297)

In my experience, Google's models have always excelled in translation (at least in the languages I speak)

[-]

DigiDecode_@reddit

the 31b ranks above GLM-5 on LMSys, my jaw is on the floor

[-]

Usual-Carrot6352@reddit

in math gemma-4-26b-a4b is No.10 🤯

Narrator: it was not better than GLM-5

[-]

_raydeStar@reddit

... Wut.

Is that real!?

[-]

Birdinhandandbush@reddit

Testing Gemma4 E4B unsloth gguf at the moment and it refuses to believe I have it running locally, its telling me its a cloud based service provided by Google.

I'm getting 65-70tok/sec which is great, so I was going to see if i can backend OpenClaw with it, but not sure I trust it if its kinda stubborn and hallucinatory already.

[-]

Adventurous-Paper566@reddit

Qwen3 VL avait le même comportement, ce n'est pas vraiment un problème.

[-]

ReadyAndSalted@reddit

E4b seems like a super good option for voice assistants. Instead of having: Audio -> speech to text -> LLM -> text to speech

You could have: Audio -> LLM -> text to speech (including agentic stuff with function calling)

[-]

keepthepace@reddit

I wonder why the bigger ones do not have audio input?

[-]

Nixellion@reddit

I wonder how it compares to whisper for speech recognition as well. And when will it be supported by llama.cpp

[-]

MrClickstoomuch@reddit

I missed that - I'm still setting up my smart home system to use LLMs for local voice, but wasn't Qwen 3.5 4b also a multi-modal model? Or would you still need to use something like Parakeet for voice to text (and the associated delay of each step). Or was that only for vision and text inputs?

If so, that's a major improvement considering it is not too far from Qwen 3.5 4b. However, it looks like the same size quant at q4 is around 5gb for E4b to Qwen's 2.75gb size while being roughly 4.5b active parameters. I'm curious how much faster or better quality it may have versus the multiple tools approach, since I don't really need to have the audio out / text to speech output with my smart home at this time (only the text as notification on my phone).

[-]

Craftkorb@reddit

Qwen3.5 (Except for 2B) is multi-modal, but that means the classic duo of Text and Images. No direct Audio input. Do note that Gemma4 does Audio, Image, and Text input, but only Text output.

[-]

_Ruffy_@reddit

Guess what will be deployed to iPhones very soon ;-)

[-]

bakawolf123@reddit

foundation models they said... I guess the recent news from that deal saying apple will open up to other providers is cause they paid billions, but in the end it's just an open model =)

[-]

Advanced-Tomorrow252@reddit

shoot my mac mini can only run the shitty version

[-]

keepthepace@reddit

Audio (E2B and E4B only)

I am curious: why only the smallest models have this? In my experience small models are pretty bad at transcribing multilingual speech.

[-]

Greedy-Lynx-9706@reddit

Can someone tell me how I should interpret those % please?

[-]

RedditUsr2@reddit

Its not ready for prime time. Downloaded google/gemma-4-26b-a4b Q6_k and get stuff like this.

[-]

Then-Topic8766@reddit

Same problem.

[-]

tommy_redz@reddit

it seems a serious bug. I also had this where it wanted to "correct" function with the same name and got in loop for 10 minutes or so until i canceled it

[-]

jld1532@reddit

Yo, I think I was trippin'. I gave it a doc to review, and it hallucinated misspelled words left and right. Odd.

[-]

RedditUsr2@reddit

Seriously mine had a mental breakdown:

[-]

RedditUsr2@reddit

User's provided code: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let's count the 'd's in the user's snippet.
1. shuffled (s-h-u-f-f-l-e-d) -> 1 'd' after the 'le'.
2. shuffled (s-h-u-f-f-l-e-d) -> 1 'd' after the 'le'. Wait, "shuffled" usually has one 'd'.
Wait, let me look at the actual word.
- "shuffle" + "d" = shuffled.
- The user wrote: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
Let's re-examine the user's code character by character from their previous prompt.
User's original snippet: ```javascript shuffle: function(library, array) { // Fisher-Yates shuffle - creates a copy to avoid mutating original const shuffled = [...array];

for (let i = shuffled.length - 1; i > 0; i--) { const j = this.getRandomInt32Inclusive(library, i); [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; // <--- LOOK HERE }

return shuffled; } ```
Let's look at the variable name in that specific line: shuffled.
The array is named shuffled.
shuffled has 8 letters: s(1) h(2) u(3) f(4) f(5) l(6) e(7) d(8).
shuffled (the one in the bracket): s(1) h(2) u(3) f(4) f(5) l(6) e(7) d(8)? No, wait.
Let's look at the spelling:
- S H U F F L E D (7 letters + 1 'd' = 8). This is "shuffled".
- User's typo in code: [shuffled[i]...
Wait, let me look extremely closely at the user's text. User wrote: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Let's count again. Word 1: shuffled -> s-h-u-f-f-l-e-d (8 letters). Word 2: shuffled -> s-h-u-f-f-l-e-d (8 letters).

Self-Correction: I am being blind. They are identical in my view because of how they look on screen, but let me re-read the user's provided source code one more time.

User's Source Code: const shuffled = [...array]; (s-h-u-f-f-l-e-d) [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Let me look at the first word in that bracket: shuffled. Let's count characters: S, H, U, F, F, L, E, D. (8 characters). Now let's look at the second word: shuffled. S, H, U, F, F, L, E, D. (8 characters).

OH! I SEE IT NOW. I was looking for an extra 'd'. But there isn't one. Let me look at the user's source code one more time very slowly.

User provided: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Wait... let me check if I am hallucinating the typo. Looking at the block:

shuffle: function(library, array) {
  // Fisher-Yates shuffle - creates a copy to avoid mutating original
  const shuffled = [...array];

  for (let i = shuffled.length - 1; i > 0; i--) {
    const j = this.getRandomInt32Inclusive(library, i);
    [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];
  }

  return shuffled;
}

Actually, in many fonts, "shuffled" and "shuffled" look exactly the same because they are both just the word "shuffled". Wait... I am looking at the user's code again. Line: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Let me re-read the word in my brain. S-H-U-F-F-L-E-D. S-H-U-F-F-L-E-D.

Wait... if they are spelled exactly the same, then there is no error! Let me re-read the user's code from the very first time they sent it. const shuffled = [...array]; [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Hold on. I will copy and paste from their text into a character counter in my mind. The user's code: [shuffled[i], ...] The user's constant: const shuffled = ...

I am looking at the letters... S, H, U, F, F, L, E, D. If both have 8 letters and are s-h-u-f-f-l-e-d... then they are the same word.

Wait! I found it. In my previous response, I was trying to tell them there was an extra 'd', but in their code, shuffled (with one d) is just "shuffled". BUT! The constant they defined was shuffled. They are the same word.

Wait... if they are the same word, where is the error? If the user's code says: const shuffled = [...array]; [shuffled[i], ...] And "shuffled" == "shuffled"... then there is no error.

Let me look at the user's source code one more time. I will be extremely pedantic. User: const shuffled = [...array]; (S H U F F L E D) - that's the word "shuffled".

User: [shuffled[i], ...] (S H U F F L E D) - that's also the word "shuffled".

Wait! I am looking at the user's source code again. const shuffled = [...array]; [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

I will look at shuffled. Is it possible there is a typo in the user's code that I am misreading because of my own brain? Let me look at the word in the bracket: shuffled. Let me look at the constant: shuffled.

Wait... I think I see it. In the array declaration, the word is shuffled. In the swap line, the first word is shuffled.

THEY ARE THE SAME WORD. "Shuffled" and "Shuffled" are just two different ways people spell "shuffled"? No. "Shuffled" is correct. "Shuffled" is... also how you'd spell it if you forgot the 'e'? No, shuffle + ed = shuffled.

Wait! I'm being stupid. Let's look at the word: S H U F F L E D. That is one word. It has one 'D'. The user wrote: [shuffled[i], ...] And they defined: const shuffled = ...

THEY ARE IDENTICAL. There is no typo in the user's code. I hallucinated a typo because I saw two words that looked slightly different to my pattern-matching brain, but upon character-by-character inspection, the user's code was actually correct and had no error there.

Wait, let me check again. User's code: const shuffled = [...array]; Line below: [shuffled[i], ...] Let me look at the letters in shuffled. S-H-U-F-F-L-E-D. Now let's look at the word "shuffled". S-H-U-F-F-L-E-D.

I see what happened. I thought shuffled had an extra 'd' (like shuffledd), but it doesn't. It just has one 'd'. And the constant they declared also only has one 'd'. Therefore, they are identical. My "correction" was useless and wrong because I misread their code as having a typo when it didn't!

Wait... let me look one more time. I will zoom in on the user's text. [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; Let me count 'd's again. Word 1: shuffled -> s h u f f l e d. (One d) Word 2: shuffled -> s h u f f l e d. (One d)

Wait! I found it. Looking at the user's code again. Line: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; Let me look at word 1. S H U F F L E D. Now let's look at Word 2. S H U F F L E D.

I AM LOSING MY MIND. They are the same! If they are the same, there is no error. If there is no error, why did I say there was one? Because my brain saw shuffled and thought "That's not how you spell shuffled!" but then realized shuffled is also spelled that way? No, shuffle + ed = shuffled.

Actually... "Shuffle" ends in an E. When you add D, it becomes SHUFFLED. If the user wrote SHUFFLE-D (with one 'd'), it's correct. If the user wrote SHUFFL-ED... wait, they are both "shuffled".

Wait! I see a difference in my own eyes. Is there an extra E? shuffled vs shuffled. Let me check: S H U F F L E D S H U F F L E D

They are identical. I've been trying to find a typo in code that actually had no typo. I am a very bad AI sometimes.

Wait... let me look at the user's line one more time with extreme focus. [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

I will count characters in shuffled. 1,2,3,4,5,6,7,8. I will count characters in shuffled. 1,2,3,4,5,6,7,8.

They are the same word. There is no typo. I apologize profusely.

Wait! Let me look at the user's constant one more time. const shuffled = [...array]; Maybe they meant shuffled but typed shuffled? But they are the same.

OK, let's be real: I was wrong. There is no error in that line of the user's code. I will admit it.

Wait... looking at the user's source again... const shuffled = [...array]; [shuffled[i], ...] Actually, if I look really closely at their text: The first one is shuffled. The second is shuffled. They are spelled identically.

Final realization: My brain was trying to find a difference where none existed because I expected there to be a typo in a code block that looked "suspicious."

Actually, hold on. Let me look at the user's text one last time. [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; Is there any other word? No.

Conclusion: The user is correct. They are identical. I am wrong. I will apologize and admit the mistake.

Wait! One more check. Does shuffled have two 'f's? Yes. Does shuffled have two 'f's? Yes. I see no difference.

WAIT!!! I just noticed something in the user's original code block that I missed before! The constant is: const shuffled = [...array]; The line is: [shuffled[i], ...]

If they are spelled exactly the same, there is NO error. UNLESS... "shuffled" (correct) vs "shuffled" (wrong)? No, they are both s-h-u-f-f-l-e-d.

I will admit I am hallucinating and apologize.

Wait! Let me look at the spelling of "shuffled". S H U F F L E D. It's a perfect word. If both words in that line are s-h-u-f-f-l-e-d, then there is no typo.

Wait! I just looked at the user's prompt again. [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; and [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]];

Yes, they are identical. I will apologize for my confusion and admit that there is no error in that line of their code. (I'll also check if "shuffled" vs "shuffle-d" was the issue, but they aren't).

Actually... Looking at the user's text: [shuffled[i], shuffled[j]] = [shuffled[j], shuffled[i]]; The first word is shuffled. The second word is shuffled. THEY ARE THE SAME.

I am going to apologize and stop trying to find a ghost typo.

[-]

Beginning-Window-115@reddit

are you using correct parameters

[-]

RedditUsr2@reddit

I used the recommended and default settings in LM studio.

[-]

Hot-Will1191@reddit

My initial impression is that 26B-A4B and 31B are extremely smooth with translation and language. Honestly, it's in a tier of its own so far which is something I've been waiting for over a year now. It even makes translategemma feel outdated instantly for my use case. E4B and E2B are a bit meh.

[-]

arbv@reddit

Gemma 3 is still better than many other SOTA models at Ukrainian, second only to Google models, FWIW. Claude and GPT caught up only in the latest versions.

The 26B A4B is a Mixture of Expert model. It requires around 16GB of RAM / VRAM to load at 4bit quantization. It means that the model is 26B parameter ”medium sized” but anytime you ask it something only 4B parameter is activated which means it will be very fast as it’s now using the full 26B at any given time.

MerePotato@reddit

Oh yeah its not really a code gen model, can't argue there

[-]

redblood252@reddit

Sounds way too good to be true.

[-]

SpiritualWindow3855@reddit

Why? We know Chinese models haven't as polished on reasoning as models from the big 3 western labs.

We also know Gemma 3 has unusually high world knowledge for its size.

So a slightly scaled up version of + reasoning would be expected to be one of the best open reasoning models out there. Qwen still has less reliable reasoning than GPT-OSS, it's the base model performance that makes up for it.

[-]

redblood252@reddit

I’m not worried about knowledge to be honest. I’m much more interested in intelligence (understanding queried history and using all information it has) and tool utilization

[-]

SpiritualWindow3855@reddit

My comment literally starts with reasoning.

ManUtdDevilsYYG@reddit

Noob here. Can it be run on iphone 13 pro?

[-]

EconomistThis5542@reddit

I just tried e2b on my iPhone with googles edge gallery, I asked it to write a dfs for me, and then my phone started to burn😭 but it is actually fast. Based on this website and google’s blog, e2b/e4b actually support native audio, which is insane

[-]

Corosus@reddit

Built latest llama.cpp

gemma-4-31B-it-UD-Q4_K_XL passed a personal niche code test I use first try that all other models have like a 95% fail rate on cause they miss one thing. We might have something special here

5070ti 5060ti 32gb combined, llama.cpp cuda, 25tps to start trickling down to 18tps after 32k context used.

E:\dev\git_ai\llama.cpp\build\bin\Release\llama-server -m E:\ai\llamacpp_models\unsloth\gemma-4-31B-it-UD-Q4_K_XL.gguf --host 0.0.0.0 --port 8080 --temp 1.0 --top-p 0.95 --top-k 64 -ngl 99 -ts 24,20 -sm layer -np 1 --fit on --fit-target 2048 --flash-attn on -ctk q8_0 -ctv q8_0 -c 96000

[-]

rpkarma@reddit

Thinks a lot, oh boy does it think a lot

Glad its not just me who saw that haha. Though it amusingly will listen pretty well if you ask it to not overthink, which is kind of neat.

[-]

Corosus@reddit

Ah nice, theres also options like adding this for llama.cpp, but I haven't battle tested it for intense code debug sessions so I'm not sure what a good value for reasoning budget would be

No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins

Eh it is still using the weird interleaved thinking mode. The other 2 new models, Trinity Large Thinking and Qwen3.6 Plus, already embrace the preserved thinking mode.

[-]

mikael110@reddit

Personally I prefer that, as preserving thinking means the context size balloons really, really quickly. And personally I haven't actually found that models that preserve thinking perform that much better than those that don't.

[-]

notdba@reddit

Do you run local inference on consumer hardware? Because interleaved thinking also breaks prompt caching.

These days, the best models like GLM-5 and Qwen3.5 support long enough context, and also don't think for too long in between tool calls. Preserved thinking should be the way forward.

[-]

WhatIs115@reddit

So what gives? I'm seeing extra Vram usage.

I can load 2 ggufs with llama, a 10.8GB Qwen3.5-27B IQ3_XXS and a 11.5GB Gemma 31b IQ3_XXS gguf with the same settings (tested with Cuda 13 and Vulkan llama builds). I'm seeing 3GB more Vram and IQ3_XXS barely fits on my 16GB.

[-]

SeaworthinessThis598@reddit

ok i may have overstated but for sure opus 4.6 territory.

[-]

Skyline34rGt@reddit

Q4K-m gguf from LmStudio model of 26b model got me 'fail load'...

[-]

unrulywind@reddit

yeah, to run it on a 5090, I had to take it down to 32k context with Q4_0 kv cache. Makes it a bit limited. Even the 26b version had to use Q4 kv cache at 128k, otherwise it ballooned up and failed.

Now I understand why Google was recently publishing papers on how to reduce the size of KV cache.

Lmarena gave it some very good rankings. I’m interested to see how it does

[-]

First_Ad6432@reddit

holy moly, im seeing infinite finetunes for it

[-]

AlternateWitness@reddit

Am I wrong, or does this look like it performs marginally worse than the Qwen 3.5 lineup?

It’s nice to see this class of model becoming more prevalent, but what would the use case for this be if Qwen 3.5 exists? Especially that 9b model…

[-]

Frosty_Chest8025@reddit

Why Gemma-4 in hugginface shows 25K downloads last month, even it was not published last month:
https://huggingface.co/google/gemma-4-31B-it

[-]

jacek2023@reddit (OP)

probably it means current month

[-]

Frosty_Chest8025@reddit

if its new model, there should be written current month

[-]

MaddesJG@reddit

It's a bit late where I am, but I threw Gemma4-26b on my mi50 32gb Ran it with -c 128000 -dev rocm0 Used the UD Q4. Llama-bench got about 939 +- 21 on pp512 and 76 on tg128

Ran a quick 2 prompt run with llama-cli and got about the same results.

[-]

jacek2023@reddit (OP)

We are now in April

[-]

sammoga123@reddit

I think you'd better forget about Llama; I heard they're definitely not going to release any more open-source models.

[-]

jacek2023@reddit (OP)

what about these avocado rumors?

[-]

sammoga123@reddit

I've already seen like 4 "secret" models, the most recent one is actually called "Leviathan" XD

They all seem to be in testing at Meta AI, but I had already seen that, according to Mark, they were going to focus on making closed-source models to compete with the rest. You know, Llama 4 was the worst model in 2025, and apparently that really hurt their egos.

[-]

berahi@reddit

That's exactly it, stories about Avocado usually mention it would be proprietary and not available for download.

[-]

sine120@reddit

The new Intel GPU isn't horrible for 32GB.

[-]

xspider2000@reddit

In LM Studio, you can try Gemma 4 via the CPU or Vulkan backend if you have an AMD iGPU. Gemma 4 26B A4B model on my Strix Halo via Vulkan gives about 50 tokens per second.

[-]

dampflokfreund@reddit

Oh, great news! Thinking, system role support, more context basically what everyone asked for, and a 35B competitor MoE too.

Oh, the hype isn't bullshit! Comparing the MoE model favourably to qwen 3.5 in my own tests right now. It's getting some very tricky shit right! STEM and philosophy, that is. And it's fast despite partial offload. Sweet af.

[-]

Craftkorb@reddit

Comparison table for Gemma4 31B + 26B and Qwen3.5 27B and 35B, source is their respective huggingface pages (Self reported values).

Metric	Gemma 4 31B	Gemma 4 26B A4B	Qwen3.5 27B	Qwen3.5 35B-A3B
MMLU-Pro	85.2%	82.6%	86.1	85.3
MMMLU	88.4%	86.3%	85.9	85.2
LiveCodeBench v6	80.0%	77.1%	80.7	74.6
CodeForces	2150	1718	1899	2028
GPQA Diamond	84.3%	82.3%	85.5	84.2
TAU2-Bench	76.9%	68.2%	79.0	81.2

I have a few random trivia questions I toss at models just to get a feel for their training data. Not so much expecting a right answer, but more to see how they fail and if they get the general gist of the topic even if getting the specifics wrong. 31b got my history, early American literature, and pop culture questions totally right and 26b came really close.

Hardly a real benchmark or anything. But it's the best I've ever seen from models this size.

[-]

Eastern_Pay8245@reddit

Anybody know if I can run this on M3 Pro w/ 18gb ram?

[-]

the__storm@reddit

llama.cpp Vulkan b8637 + 26B-A4B-it-UD-IQ4_XS (on 7800 XT 16GB) seems to have a bug in its fit/context size estimation (or at least it's way too conservative). Using --fit I have to dial the context target all the way back to 256 (lol) to get it to not offload any layers, but if I force --ngl 99 it complains a bunch but loads and runs fine up to a context of about 20K.

[-]

SamuelL421@reddit

I've not used any of the Gemma models before, is there room to run these (either 26B A4B or 31B) with reasonable context if you have 32gb or 48gb setup of VRAM?

[-]

Middle_Bullfrog_6173@reddit

FWIW, on my short gauntlet of multi-lingual language modeling tasks that I was still using Gemma 3 for:

26B A4B beat Gemma 3 27B clearly 31B edged out Gemini 3.1 flash lite

Does llama.cpp support speculative decoding for Gemma4 right now?

Was disappointed with Qwen3.5, for which speculative decoding is still WIP in llama.cpp.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

durden111111@reddit

People talking about gemma 4 being worse despite not even being able to run the ggufs yet as llama cpp doesnt have support

[-]

shockwaverc13@reddit

gofiend@reddit

Pretty insane to see the E4B model beating one of the best models from last year. Unlikely to be true in broad real world use but a great signal anyway

[-]

popiazaza@reddit

This is much more interesting than their Gemini models.

Both Gemma 4 31b and 26b-a4b have higher elo than their proprietary Gemini 3.1 Flash Lite model.

Friendship ended with qwen all hail the dark lord gemma4

[-]

Daniel_H212@reddit

Had gemini generate a visualization of benchmark scores between gemma 4 and qwen3.5 for me (model cut off on the right is qwen3.5-35b-a3b)

Some sizes like 15B, 50B, 90B, 150B, 300B are pretty empty right now.

Nice. Gemma3 27B has been my favorite general-purpose conversational model for some time.

The 26B is a MoE, but the 31B is dense? Seems backwards?

I have to say.. I can't find any other licensing info other than that Apache 2.0 attribution.

Can we get open omni models for all sizes and at least Nano Banana 1 level of image gen and editing in like a Gemma 4.1/.2 or something please now Google? Finally getting a good quality LM that can do images and editing too is something I've been waiting for.

[-]

jacek2023@reddit (OP)

try again

[-]

jacek2023@reddit (OP)

[-]

MundanePercentage674@reddit

https://www.youtube.com/watch?v=jZVBoFOJK-Q

[-]

jacek2023@reddit (OP)

thanks!!! added

[-]

Everlier@reddit

it's been a quiet Thursday evening... I wanted to play some Crimson Desert...

But nownI have something much much better to do :)