One year ago DeepSeek R1 was 25 times bigger than Gemma 4

[-]

FoxiPanda@reddit

I am deeply impressed with Gemma4-26B-A4B-IT (Q4_K_L GGUF from Unsloth). I'm primarily using it for historical document transcription / handwriting deciphering from the late 1700s through the early 1900s and it is better than a lot of Frontier models for that task. Only Opus 4.6 and Gemini 3 really compare - it destroys GPT5.4 at handwritten transcribing and is generally better than Sonnet 4.5/4.6 too.

[-]

cunasmoker69420@reddit

Have you tried any of the Qwen3.5 models for handwriting deciphering? Does Gemma beat those as well?

[-]

FoxiPanda@reddit

Yes - see my more expanded post in this same thread: https://old.reddit.com/r/LocalLLaMA/comments/1scovb5/one_year_ago_deepseek_r1_was_25_times_bigger_than/oeh3bx6/

[-]

-dev4pgh-@reddit

This is very interesting to me, as I have found Qwen3.5-9B to be the best specifically for a range of (sometimes old) handwriting. Earlier Qwen vision models also beat others at the time. But I have not tested any of the new Gemma 4. Have you ever used Qwen3.5 for similar tasks, and determined that Gemma 4 is now even better? If so, that is awesome news.

[-]

FoxiPanda@reddit

So let me describe the workflow and perhaps I can tell you about how I go about it and what settings I use and what I like about Qwen3.5 and what I like about Gemma4.

Models:

Qwen3.5-35B-A3B unsloth Q8_K_XL variant
Gemma4-26B-A4B-IT unsloth Q4_K_XL variant
One might ask "why different variants"...don't ask this. It sorta comes down to speed, but it also sorta comes down to "this is what I downloaded first and it worked so well that I stuck with it"...

Qwen3.5 Launch settings:

/opt/homebrew/bin/llama-server \
  --model ~/models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf \
  --mmproj ~/models/Qwen3.5-35B-A3B-mmproj-F32.gguf \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --threads 16 \
  --parallel 1 \
  --batch-size 1024 \
  --ubatch-size 1024 \
  --cache-type-k bf16 \
  --cache-type-v bf16 \
  --flash-attn on \
  --jinja \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.95 \
  --metrics \
  --mlock \
  --host 0.0.0.0 \
  --port 8092

Gemma4 Launch Settings:

/opt/homebrew/bin/llama-server \
  --model ~/models/gemma-4-26B-A4B-it-UD-Q5_K_XL.gguf \
  --mmproj ~/models/mmproj-gemma-4-26B-A4B-it-F32.gguf \
  --port 8081 \
  --ctx-size 262144 \
  --n-gpu-layers 999 \
  --threads 24 \
  --threads-batch 24 \
  --flash-attn on \
  --cache-type-k bf16 \
  --cache-type-v bf16 \
  --temperature 1.0 \
  --top-p 0.95 \
  --top-k 40 \
  --min-p 0.01 \
  --mlock

One might also note "Hey, those aren't apples to apples settings!" and you'd be right...different models get different settings. This is where I put on my deal with it sunglasses. This is what I've found works best so far for me/us.

So the workflow is to paste in some heinously terrible handwritten document that looks similar to something like https://mtv-drupal-assets.s3.amazonaws.com/files/resources/chastellux-letter-vine-12.jpg but usually far worse quality / ink spills / smudges / faded writing / etc (note: that one is a modestly recognizable one from George Washington's Mt. Vernon collection and has nothing to do with me personally).

I use this prompt typically as my basic test prompt:

Can you tell me what this image is and if it contains text, can you please take some time and try to transcribe it? Please add (?) where you are unsure/uncertain about a piece of text. Don't get stuck though if it's indecipherable - just mark it as (?) and move on.

Workflow (once the model is up and warm and all that):

Paste in picture of terrible ~~cursed~~ cursive handwriting that someone committed to paper 100-300+ years ago.
Paste in the above prompt.
See what magic (or not) comes out.

Here's what I like about Qwen3.5-35B-A3B:

It's a thinker. It will analyze every single line, character, ponder, go back and look again if it finds the same word or similar words but realizes it made a different transcription previously and then it will compare the two, decide if they're the same or not and then after some indeterminate amount of time (usually like 60-100 seconds), it will pop out an answer. I've watched it reason in real time and on my hardware (M3 Ultra Mac Studio 512GB w/ 80 GPU cores), it gets like 65tok/s and can output thousands of tokens in the thinking blocks before it pops out a final answer. That answer is usually pretty good (usually like 2-5 words incorrect on a 100 word document). However...it almost never uses the (?) uncertain tags. It's a confident sometimes incorrect model that kind of just ignores that part of the prompt.

Here's what I like about Gemma4-26B-A4B-IT:

It's fast and thinks substantially less, but comes up with the same answer as Opus like 95%+ of the time. It still reasons quite a bit - but way less than Qwen3.5. Qwen will spend 70 seconds pondering over the letter T and Gemma won't. Gemma is happy to mark it with a (?) and let a human figure out whether that's a T or an I. This is good behavior in my opinion. It also will spit out an answer in 20-30 seconds instead of 70-100 seconds...and it's usually just as good or better than Qwen's answer. Gemma does not ignore the prompt and does the same quality (+/- a small percentage) as Qwen in far fewer tokens and far less time. 25 seconds vs 80 seconds is kind of a big deal for me especially if the output is about the same quality.

[-]

SSOMGDSJD@reddit

In my use of these models, qwen3.5 tendency to think and think think and think some more has been the bane of my existence. I like to try to compare ranges of models for different things and they always require special handling.

The Gemma 4s like you said, think briefly, execute well. It's a breath of fresh air tbh

[-]

FoxiPanda@reddit

Lol yeah Qwen thinks A LOT. SO MUCH. I appreciate Gemma4's capability without the intense lag because of the thinking.

[-]

-dev4pgh-@reddit

This is incredibly helpful and informative. Thank you so much for writing all this up and sharing it!

[-]

FoxiPanda@reddit

Sure no problem. FYI I have read this morning that all the GGUF files for Gemma4 will be regenerated today so if you already have them downloaded from Unsloth/bartowski/wherever else... go check for new ones...apparently several fairly important fixes.

[-]

LostRequirement4828@reddit

Try apex 1 compacted gguf one, I find it miles faster and no lose in quality at all for me specifically, it actually seems better than unlosth ud ones, if you want to be sure you dont lose any quality but still improve your speed, you can go apex 1 balanced but trust me, compacted does the job very very well

[-]

rinaldo23@reddit (OP)

Impressive usecase

[-]

FoxiPanda@reddit

Thanks, it’s primarily for genealogy purposes. My family does a lot of it (even though at a personal level I’m not that interested). I do try to help them with the latest state of the art tools though and Gemma4 is very much getting integrated into our workflows now. It’s genuinely great for this use case.

[-]

CatalyticDragon@reddit

It really is impressive and helps illustrate just how green this field is.

You can generally chart the maturity of a technology by the cost of improvement and by that metric we are far from "AI" being mature.

[-]

Glazedoats@reddit

:D

[-]

Immediate-Word1958@reddit

What's been wild to watch from China is how fast the ecosystem grew around DeepSeek in just one year. A year ago most Chinese devs were still defaulting to GPT-4 via workarounds. Now DeepSeek and Qwen are the go-to for most local projects.

The pricing played a huge role — DeepSeek V3 API is roughly 8-10x cheaper than GPT-4o per token, and for everyday coding and bilingual tasks, the quality gap has basically closed. That kind of price difference changes behavior fast.

The interesting part is that Qwen, GLM, and others are all pushing hard too. Competition here is intense in a way that directly benefits devs. Every few weeks there's a new release trying to one-up the last.

[-]

Technical-Earth-3254@reddit

Size never really scaled with potential. Otherwise Kimi K2.5 would have to be 5 times better than Step 3.5 flash, which it isn't (at least for what I've tried them).

The best strategy is to run the smallest model possible that does whatever job you need it to do. If that's Gemma 4, that's cool. If it's the newest DeepSeek model, then it's fine too.

[-]

saig22@reddit

OpenAI realizing that by going bigger you could perform zero shot and few shots on various tasks with language models is exactly what started the LLM era. Saying size never scaled with potential is simply wrong. We would still train 20 million parameters specialized BERT-based models if we did not realize that performance scale with size. The entire history of Deep Learning research is filled on bigger is better and the need to invent new architecture (such as resnet) and regularizations so we can train always bigger models. It is a good thing that we have smaller LLM beating bigger LLM from a year ago, but scale will still be a factor. I fully expect Gemini 4 to be a huge (1+T) model that will blow gemma 4 out of the water.

[-]

twack3r@reddit

Do we have any indication how large current closed models are? Opus 4.6, GPT5.4, Gemini 3.1?

Given GLM 5 and 5.1 are 1.5T+, I would expect the frontier models at at least factor 3.

[-]

Zeeplankton@reddit

I don't think they are much larger.

OpenAI inference is really, really, fast. I simply cannot imagine them serving a 1.5T+ model to 200 million users daily. They might've tried with 4.5 - which had big rate limits, was really expensive via API, and quickly killed.

I think basing it off that, and assuming openAI has no particular moat, sonnet and opus are probably a decent bit larger because they can be.

Gemini could be stupid large and it doesn't matter since it's google TPU.

[-]

BlueSwordM@reddit

I wouldn't be surprised if all closed LLMs are massive multi TB and even deca-TB models, but with high sparsity.

Would certainly make a lot of sense.

[-]

StupidScaredSquirrel@reddit

Personally I'm not entirely sure gpt 5.x is a single model. I have such different latency and behaviour depending on the field and task that I wonder if it's just a bunch of models in a Trenchcoat.

[-]

SalariedSlave@reddit

GLM 5 is 754B, not 1T, no? Not sure about 5.1.

I think Kimi K2.5 is still the largest open model at 1T, the other open frontier models hover under 1T at this time.

[-]

twack3r@reddit

You are correct, I conflated parameter count between GLM 5/5.1 and Kimi K2.5

[-]

Rangizingo@reddit

I can’t believe I’m defending OpenAi but there was a point where bigger meant better and is was a demonstrable thing but they assumed it was infinite. There’s only so much data

[-]

Few-Equivalent8261@reddit

There's new data every day

[-]

portmanteaudition@reddit

But it's not stationary. Over time, parameters change essentially.

[-]

dw82@reddit

What resources are available to help identify the smallest model that performs a particular task?

[-]

exaknight21@reddit

I recently tried PrismML’s 1 bit 8B at 64K context and was literally blown away at the knowledge and coherency. Zero hallucinations in my tests and it felt like I was actually speaking to a Qwen3:4B model.

The future is seriously bright. PrismML isn’t getting the love they deserve. The scalability is at 1 bit, not FP16 or FP8.

Time will tell friends. I’m excited.

[-]

megakilo13@reddit

It hallucinates big. Can’t even pass a simple ask of “who is Kobe Bryant”

[-]

exaknight21@reddit

My tests were RAG based. It did good. But i agree, general knowledge is off.

[-]

layer4down@reddit

OTOH if it’s marketed to the enterprise then general knowledge is not all that important unless you need very general purpose chatbot for some reason. I think we’re overly inclined to treat most use cases as such.

[-]

TylerDurdenFan@reddit

I wonder how hard is it for PrismML to apply their proprietary quantization methods for larger models (and for MoE in particular). Bonsai blew my mind, and while I believe much larger models at 1 bit would pack a bigger punch, I fear they may be much harder to quantize to 1 bit

[-]

Antique-Bus-7787@reddit

I had the assumption that bigger models handled quantization much better than small models ? Wouldn’t this be the case for this new method too ?

[-]

TylerDurdenFan@reddit

Absolutely.

What I'm talking about is PrismML's proprietary, "math based" mechanism for creating the quantized versions. What if that becomes exponentially expensive with model size?

I can only speculate since they don't open source that part.

[-]

exaknight21@reddit

You know, hard is subjective. The trade off is an extremely efficient model.

It’s beyond my comprehension why everyone isn’t trying these models.

I created a post yesterday on how to run it on an Mi50 32GB with llama.cpp - mind you, this is a very outdated GPU and it performed at 100+ tps for 1.7b and 87 tps for 8B model at 64K context. All within the GPU no offloading. Imagine scalability of this thing.

I had it generate a legal contract for comparison, 1.7B had some hallucinations, but was on part with a 3B Llama 3, dare is say, Llama 3 is actually dumber.. then I tried the 4B, straight up the same exact thing as qwen3:4b, then i tried the 8B, and it legitimately blew me away. It’s at chatgpt level (whatever is available at chatgpt.com) and I am not saying it’s better or whatnot, but the fact that I could have it generate a proposal in detail, it got my attention.

I’m now testing my RAG iteration with it to see how good it can actually perform.

[-]

joost00719@reddit

Is it just me or is gemma4 26b MoE just bad? It calls tools with wrong parameters, get stuck in a loop cuz the tool says the parameters aren't right. It edits json files and ends up with invalid syntax...

I've tried openclaw and opencode, both without much luck.

Qwen3.5 35b MoE is so much better in any way for me.

[-]

markole@reddit

Make sure to use the latest llama.cpp and to redownload the weights.

[-]

joost00719@reddit

I was using the latest yesterday. I did use Llama-server through Llama-swap. Vibe coded a script that automatically pulls the latest and greatest.

I'll re-download the model and see if that changes anything. Thanks for the tip

[-]

thejoyofcraig@reddit

There were some issues that have only just been resolved in Gemma 4's tool calling (was broken at least in mlx-vlm until recently). So you might update your binaries and try again.

[-]

rinaldo23@reddit (OP)

This happened to me as well, I'm currently testing Qwen3.5-9B-Q5_K_M with opencode

[-]

yes_yes_no_repeat@reddit

I am using Gemma for Vision and point and click. The spatial is very good A4B at Q4 E4B Q8 is not useful for my case scenario.

Here my bot built on PI-mono self benchmarked itself:

Correction after Linux-side inspection and rerun with reasoning enabled

Linux host findings - The Gemma multimodal setup is not obviously broken - Both Gemma profiles already include mmproj - This llama-server build supports reasoning only as on/off/auto, not a native high mode

Same attached screenshot, rerun with reasoning ON

E4B - Prompt: destination cell click Expected: around C4 Result: C3 - Prompt: destination cell center Expected: around C4 Result: C2 - Prompt: search button cell Expected: around H4 Result: C1

A4B - Prompt: destination cell click Expected: around C4 Result: C4 - Prompt: destination cell center Expected: around C4 Result: C4 - Prompt: search button cell Expected: around H4 Result: H4

Updated conclusion - A4B with reasoning enabled is clearly better than E4B for full-page click-cell selection on this screenshot - E4B is still cheaper and good for OCR / state read - The earlier A4B failure looked more like an output-budget / reasoning interaction than a broken multimodal config

[-]

Positive-Stock6444@reddit

It’s less about knowledge it contains, and more about it knowing what it doesn’t know and then being able to use tools to work effectively.

Models as a database was a cute demo, but hallucination and confidently wrong cures users of the cuteness pretty quickly…

Capability is the new benchmark, and small capable models plus tools are the real local frontier.

[-]

Ornery-Ad2484@reddit

Ufortunatelly gemma 4 cannot write properly in Polish language. Content is acceotable but gramaticaly its a nightmare

[-]

AccordingWarthog@reddit

Is this with the latest llama.cpp including the tokenizer/quantization changes?

[-]

Ornery-Ad2484@reddit

Not sure, i use IT on Spark dgx with default ollama setup

[-]

evia89@reddit

Write in english then translate with another one?

[-]

Ornery-Ad2484@reddit

Useless when you want to use IT on priductuon scale with 15 employees

[-]

ValisCode@reddit

is there any report comparing Deepseek R1 full size with the Gemma 4 family?

[-]

Foreign_Yard_8483@reddit

1) High distillation efficiency and no real deep thinking (not even in an emergency

3) It's like having a cyborg that mimics the 21st-century consumer. It goes to the office, answers calls, responds cordially, shops, and pays bills.

But it won't think for itself about crossing a skyscraper on a cable; nor will it get it into its thick skull that the earth is flat.

[-]

Macestudios32@reddit

The comparison, as others have already commented, does not make sense.

It is not the same with a tourist guide of a city as with a historian versed in that city.

One will be able to handle people better and tell 4 things about the city which is what always counts and the other can talk about the city days and days

Do we want a model with all the knowledge in the world? It's one thing Do we want a model that better understands requests, and extracts knowledge from the internet or elsewhere? It's something else.

The usual, intelligence is not the same as wisdom.

[-]

Zeeplankton@reddit

Not 25x worse, but still worse, despite benchmarks saying otherwise. I feel like there is always a small model "vibe". Logic gaps and assumptions are just larger and more nonsensical. I think parameter count or just raw knowledge is still critical.

[-]

Designer_Reaction551@reddit

The compute trajectory is genuinely wild. R1 needed enterprise-grade hardware, Gemma 4 runs on a decent consumer GPU. The capability-per-parameter improvements are compounding faster than most predicted. Distillation techniques are doing more work than people realize.

[-]

blablarthur@reddit

the thing thats funny is that the price per M tokens are about the same even though it's 25x times smaller 😅 (at least on openrouter)

[-]

Rich_Artist_8327@reddit

It just means larger and older models were unefficient.

[-]

meca23@reddit

When R1 came out, it was lauded for it's efficiency. At the time It was the cheapest SOTA model for inferencing at around 1/10th the cost of the competition.

[-]

Rich_Artist_8327@reddit

the cost has nothing to do about the model. The cost is decided by the company, some times behind a company can be a state who says lets sell this model cheaper so that the other modes becomes unrelevant.

[-]

LegacyRemaster@reddit

And a year later, absolute silence from them...

[-]

GreenGreasyGreasels@reddit

I think this can be overstated. While current small models are many times better than old models from a year or two ago, large models with similar arch and comparable scaled training data and recipe are a whole ball game all together.

A 200B Deepseek v4 lite would be many many more times capable than a 32B model - despite benches saying 78 vs 81 in this or that metric. This is a limitation of the benches and what they can capture, not a true comparison of their relative capabilites.

If all you are doing is creating flappy bird or shot a landing page the difference is moot, but for anything that requires some depth of expertise, nuance and sustained work the larger models dominate much more than their relative size might imply.

I am grateful for the wonderful local models I can run but I have no illusion how Opus 4.6 or even the venerable Deepseek V3.2 completely outclass smaller local models.

[-]

Eyelbee@reddit

The whole point of the post was to point out 30B tier models are better than 15x models of a year ago, of course you'd expect today's larger models to surpass those.

[-]

matt-k-wong@reddit

agree and disagree at the same time. If you were stuck using only one model you would most certainly be better off using a frontier model for everything. However, the way I see it is that small models can do approximately 90% of what frontier models can do and they do it faster too. If you have the luxury of only using frontier models all the time this is clearly superior but there is a threshold where at a certain complexity there us a small difference in output quality between small models and large models. If you knew the complexity of each and every task you could in theory delegate the task to the approropate model and reap the speed benefits as well.

[-]

JohnMason6504@reddit

The compression ratio is even crazier when you factor in quantization. DeepSeek R1 at FP16 needed multiple A100s. Gemma 4 at Q4 fits in 16GB VRAM and arguably matches or exceeds R1 on most reasoning benchmarks. We went from needing a data center rack to a single consumer GPU in 15 months. The MoE architecture improvements from Google are doing a lot of heavy lifting here.

[-]

amethyst_mine@reddit

this has to be the worst comparison i have ever seen

[-]

matt-k-wong@reddit

I've noted the trend of intelligence density per parameter count becoming increasingly compressed. To me, it feels like we've reached an inflection point where the average laptop now has access to decent LLMs (defined by 32GB or so). Further, I expect this trend to continue. I would not be surprised if future \~70B models demonstrate the agentic grit of the current 120B models, though one would hope they achieve similar results in the 30B class.

[-]

wahnsinnwanscene@reddit

There's a bifurcation though, where for a task T, getting multiple agents to work on a task might be better than a single monolith. There's probably an upper limit to this understanding density.

[-]

matt-k-wong@reddit

I've played with this a lot and I confirm that a group of agents each given the information they need and proper tasking (prompts) is better than a single large monolith.

[-]

itsmebenji69@reddit

It always is. The less “cognitive load” you give the model, the more it can “focus” on the specific task.

It’s very important to limit the scope and context of each task. Especially with those small models where the model may be able to perfectly do each single task, but fail at doing them all at once

[-]

netherreddit@reddit

For comment C, I have made observation O, that leads me to conclusion C.

The equation that uses these variables? Uh, I forgot about that. But call it E.

[-]

DraconPern@reddit

*cries in amd laptop*

[-]

blbd@reddit

They need to put Medusa Halo in more laptops. It's a bit too late for Strix. Strix didn't make it into any laptops with 4K or higher res screens.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Mister_bruhmoment@reddit

I think the launching pad for small models are the tools. I can't stress how gimmicky all the local models felt to me when playing with them in LM studio. A week or so ago though, I saw someone say that web search made their Qwen be exponentially more useful. I tried it and it really did become much more of an assistant in that moment. In the past days I have just been thinking of tools to add to its arsenal, so that is basically just becomes a brain that decided which tools are right for the job. Its been pretty awesome to say the least. Would be even more awesome if I had the ability to run models above 9B at high context but thats not relevant.

[-]

bzrkkk@reddit

Value is a composition of multiple things:

1) Model parameters (training) 2) Data parameters (training) 3) Context parameters (runtime) 4) Sample parameters (runtime) 5) Environment (training+runtime)

[-]

Altruistic_Heat_9531@reddit

Disclaimer : I have connection to few labs that train an AI model.

In simple terms, it is because the paradigm in data set being use is different, back then the goal is "World Model" basically cram entire internet or world knowledge into a model, while today the goal is "Trajectory Model" where the model is trained to mainly think and process a given input, the world knowledge is came the from tool usage such as web mcp or RAG.

[-]

max123246@reddit

Yeah, there were a couple studies a while back where they realized that training an LLM on everything mean you train it on the poorly thought out inputs too, which turns out to be most stuff on the internet. Just limiting it to research/experts of different fields leads to a better model.

[-]

createthiscom@reddit

2025 was absolutely insane. They went from not even being able to do basic addition to fully grasping advanced math, overnight. Now, GPT 5.4 seems to grasp subtle nuances and knows when to say “I don’t know, let me look it up.”

It feels like DeepSeek is several years behind now, even though it’s probably only 6 months of OpenAI’s calendar time.

I only played with Gemma 4’s image comprehension capabilities today, but it does indeed seem like a very high quality model. I think we’re only going to see more small specialized models in the future as robotics demand accelerates.

[-]

soporificx@reddit

They could do advanced math (proofs) before they could do basic math (arithmetic). Those skills really aren’t related and it’s pretty common for humans to also not really have a correlation between arithmetic skill and logic skill. The basic math skill improved for LLMs because they started sending those questions to a calculator like python instead of having the LLM spit out a number since they realized people expected it to have that skill.

[-]

exaknight21@reddit

I recently tried PrismML’s 1 bit 8B at 64K context and was literally blown away at the knowledge and coherency. Zero hallucinations in my tests and it felt like I was actually speaking to a Qwen3:4B model.

The future is seriously bright. PrismML isn’t getting the love they deserve. The scalability is at 1 bit, not FP16 or FP8.

Time will tell friends. I’m excited.

[-]

PhotographerUSA@reddit

Now it runs slow and inaccurate .