Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex ! | TheaterFire

Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

Posted by cviperr33@reddit | LocalLLaMA | View on Reddit | 84 comments

It solved a issue with a script that pulls realtime data from nvidia smi , gemini 3.1 failed to fix it at a fresh session start lol.

Kinda mind blowing how in 2026 we can already have stable 200k+ contex local models. I tested it out my putting as much reddit posts and like random documentaions and just raw files from llama.ccp repo , so i can bumb it as much as possible and see how it affects my vram. But during this testing gemma still had his mind intact !

245283/262144 (94%) at this contex , if i ask it to tell me what this user said and perfectly matches it and tells me , within 2-5 seconds

From previous tests i did , i had to decrease the temp and bump the penalty to 1.18 so it doesnt fall into a loop of self questioning , above 100k it started to loop into his own thoughts and arguing , and instead of deciding to print one final answer it just kinda goes forever , so these settings helped a lot !.

Using the latest llama.ccp that gets like new updates every hour , and latest unsloth gguf that got updated 2-6 hours ago , so redownload !

Model : gemma-4-26B-A4B-it-UD-IQ4_NL.gguf , unsloth (unsloth bis)
These are my current settings for llama.ccp , that i start with pshel script :

# --- [2. OPTIMIZATION PARAMETERS] ---
$ContextSize = "262144" 
$GpuLayers = "99"
$Temperature = "0.7"
$TopP = "0.95"
$TopK = "40"
$MinP = "0.05"
$RepeatPenalty = "1.17"
# --- [3. THE ARGUMENT CONSTRUCTION] ---
$ArgumentList = @(
    "-m", $ModelPath,
    "--mmproj", $MMProjPath,
    "-ngl", $GpuLayers,
    "-c", $ContextSize,
    "-fa", "1",
    "--cache-ram", "2048",
    "-ctxcp", "2",
    "-ctk", "q8_0",
    "-b", "512",               # Smaller batch for less activation overhead
    "-ub", "512",
    "-ctv", "q8_0",
    "--temp", $Temperature,
    "--top-p", $TopP,
    "--top-k", $TopK,
    "--min-p", $MinP,
    "--repeat-penalty", $RepeatPenalty,
    "--host", "0.0.0.0",
    "--port", "8080",
    "--jinja",
    "--metrics"


)

What else i can test ? honestly i ran out of ideas to crash it! It just gulps and gulps whatever i throw at it

[-]

PassengerPigeon343@reddit

I know the 31B version is technically stronger, but this 26B is becoming my favorite because it is ridiculously fast and I am genuinely impressed with it. I still need to try today’s updates and do some tweaking, but it is incredible so far.

[-]

LeRobber@reddit

26B is so so so fast, it's truly adictively so.

[-]

IrisColt@reddit

You nailed it!

[-]

IrisColt@reddit

It also seems that it's slightly better than its Gemma4 31B counterpart at creative writing, according to EQBench. (I am currently verifying this.)

[-]

cviperr33@reddit (OP)

Same findings , once you try speed you cant go back to dense models 😄

[-]

Nightishaman@reddit

I get 700 tokens/s with my Mac

[-]

PassengerPigeon343@reddit

Damnit, this comment may very well cost me thousands of dollars. And I thought mine was fast at 110-120… Now I want one of the to-be-released M5 Mac Studios…

[-]

Nightishaman@reddit

Im actually thinking about getting the M5 Ultra Mac Studio with 128 GB of RAM once it comes out because privacy is very important for my Software Development

[-]

cviperr33@reddit (OP)

700 tokens/s on mac ? what :O thats the 26b a4b model ? what mac is this and what do you use for inference engine?

[-]

Final-Frosting7742@reddit

I think what they're not saying is that it's on prefill. Still a good performance though.

[-]

Nightishaman@reddit

I don’t know. It was just a bar in oMLX before the tokens arrived in Claude Code. It was still very fast for it to be enjoyable.

[-]

cviperr33@reddit (OP)

yeah i couldnt believe it so i checked macs and the newest ones are actually very close to 3090 in token gen , in pre fills yeah 700 while 3090 runs at 3700- 3900

[-]

Nightishaman@reddit

It’s the 26b-a4b model with 4 bits quantized on MLX engine. I use oMLX as a frontend and caching layer. I set caching to 46 gb max and hot cache 8 gb on my Mac Studio M4 Max 36 GB

[-]

IrisColt@reddit

What else i can test ? honestly i ran out of ideas to crash it! It just gulps and gulps whatever i throw at it

Try with "avoidance prompt clauses", heh... In my benchmarks there are a lot of them such as:

Do not foreshadow Avoid fixed response cadences or checklists. Invite user input without repetitive phrasing. Do not this or that... Aim at their current pet peeves. Gemma 4 is better than Qwen3.5 but still not perfect.

[-]

cviperr33@reddit (OP)

OMG LOL ! 😂 the OG IrisColt , sorry if i used your name without permision in this post , i was honestly just testing the model perfomance and i was copy pasting random reddit threads , and it just so happens when i was picking up a random name i picked you :D .

Also thanks for the suggestions ! i havent thought about them. Also yes i noticed that gemma 4 is like really good in copying human emotions or like acting more like human , qwen was more like a machine , gemma is too if u prompt it that way but it just sometimes when i read its CoT , it suprises me how far we have reached into AI potentional.

[-]

IrisColt@reddit

:wink: That said, I didn't think I'd be genuinely impressed by a model so soon after Qwen 3.5 27B became part of my workflow, but here we are completely blown away by Gemma 4's strengths.

[-]

IrisColt@reddit

Try ambivalent prompt sequences... Gemma 4 is less prone at having second thoughts about a topic, but in such cases it admits that it's hard to choose, and roughly 50/50% of the time chooses one or another, humanely justifying its choice. Qwen 3.5 is more a bot without such moral qualms, and steers itself towards its preferred answer quickly, no strings attached (I hate that, and I perceive that behavior as generally less emotionally intelligent).

[-]

IrisColt@reddit

Thanks!!!

[-]

Sadman782@reddit

Same experience, I use IQ4 from unsloth and can't believe how good it is. It's very underrated and many have a bias that it's worse due to many issues in llama.cpp actively being fixed, people using bad old chat templates for agentic coding, or using ollama which is slow to update and same for early broken lm studio etc. This unsloth quant is gold, very close to the AI Studio official release as per my experience.

[-]

Fair_Ad845@reddit

agreed, the A4B variant is underrated. for my use case (long document QA) the 94% context retention is more useful than raw benchmark scores. a model that keeps coherent at 200K tokens beats one that scores 2% higher on MMLU but falls apart at 32K.

[-]

IrisColt@reddit

Gemma4 31B doesn't seem to fall apart at 32K context but 24GB VRAM user cannot exceed 64K without significant speed penalties.

[-]

Specter_Origin@reddit

I couldn't agree more! It has been really good for me, and I have never been able to get it stuck in a loop (whereas I had a really bad experience with this with Qwen A3B and also extream overthinking).

LM Studio is great in terms of convenience, but it's just not cut out to try new models.

[-]

cviperr33@reddit (OP)

Hmm interesting , such a low top-k value i have never seen any1 use before , if i get stuck on a issue i will try your settings , thanks !

[-]

ElKorTorro@reddit

When I'm in LM Studio and search for "Gemma 4", I see a long list of Gemma-4 models that seem to be different versions/modifications of it? What's the difference in all these permutations? E.g.

Gemma-4-26B-A4B-JANG_4M-CRACK

gemma-4-26B-A4B-it-GGUF

Why are some models like 900MB and others 15GB?

[-]

cviperr33@reddit (OP)

Just download the ones by unsloth , type in the server gemma 4 26b a4b and in the drop down menu select a model that can if into ur vram (note size = vram size pretty much) .

Gemma-4-26B-A4B-JANG_4M-CRACK is the jang am crack gguf , he does uncensored or abliterated quants , if u want this kind of thing .

[-]

Far-Low-4705@reddit

I’ll be honest, I was also quite surprised with qwen 3.5 too, Ik ppl say it “falls apart after 64k”, but it was still absolutely usable at all contexts I tested.

This is also true for Gemma, although I think llama.cpp still has some bugs with it.

I think all models are not as effective at crazy context lengths, but they are definitely still usable, just probably shouldn’t be asking it to one shot the next Claude opus at that context

[-]

RedditSylus@reddit

Is this model any good at coding for html,ccs, JavaScript, swift(ui) for front end development or making iPhone or Mac apps native

[-]

balder1993@reddit

One thing about it is that it does know Swift 6 and the Swift Concurrency concepts like actors. It’s probably not the most up to date though.

[-]

cviperr33@reddit (OP)

This is the UI it built in oneshot , 1 prompt and i didnt click or do anything . it can def do beatiful UI , as long as you are specific , this one took about 2-4min from start to finish on my gpu.

[-]

cviperr33@reddit (OP)

prompt was :
**"Build a full-stack 'Rent vs. Buy' Investment Simulator using React and Tailwind CSS. The app should have a split-screen view: an input sidebar on the left and a results dashboard on the right.

Logic Requirements:

Inputs: Home price, mortgage rate, down payment, monthly rent, and expected annual stock market return.
Simulation: Calculate the 30-year net worth for two paths: (A) Buying the home (including tax, maintenance, and equity build-up) vs. (B) Renting and investing the down payment + monthly savings into a brokerage account.
Visualization: Use a library like Chart.js or Recharts to show a line graph comparing the growth of 'Total Net Worth' over 30 years for both scenarios.
UI: Use a clean, 'Fintech-style' dark mode theme with cards for key metrics like 'Breakeven Year' and 'Total Interest Paid'."**

[-]

RedditSylus@reddit

This is nice. Yes, finding out the prompt is everything. There is guy on here, prompt master, that has awesome prompts and free and made agent, I think, that you put your prompt in and it will spit out an updated professional prompt. Pretty awesome stuff.

[-]

cviperr33@reddit (OP)

Yeah apsolutely. And thats the beaty of current AI models , since they are trained on natural language , the best way to communicate with them is with words lol.

If you noticed from all of these open source tools and also from the leaked claude code repo , all of these tools share the same principle , they are just fancy prompt instructions literally , just words :D .

So prompt matters a lot ! System prompt drastically changes how model performs , you could have 1 in the world gemma 4 model because your system prompt is way different than the rest of the world !

For example if i dont specify in the system prompt that the current year is 2026 , the model would refuse to do a search call to the tool because it still thinks its 2024 therefore by following logic , 2026 doesnt exist and its more efficient to not tool call ! Which is apsolutely true.

[-]

RedditSylus@reddit

As I built a system and front end for lm studio using 3 llms 1-14b and 2-9b that can make apps. You type in what you want and it creates the app in swift and uploads into Xcode and all you do is hit run and done. Curious if this model can do something like this itself. Any input is appreciated.

[-]

cviperr33@reddit (OP)

how much RAM or VRAM do you have ? and yes i believe it can do all of this , you just have to try it haha.
This is what makes the open source models exciting and awful at the same time , you dont know until you try it and they require a lot of tuning / trying . Its not as simple as pay 200$ , copy this api in ur tool and go ahead.

[-]

RedditSylus@reddit

I am using Mac m3 max with 36gb memory and yes I kick my butt not getting the 128gb but who knew at the time. I never got into ai until about a few months ago and after building front end. I am addicted now.

[-]

cviperr33@reddit (OP)

Tell me about it haha im angry at myself i didnt get more memory last year when it was cheap and now im stuck with just 32GB vram , when i launch these models i have only like 7 GB ram free.. Enough to do any work i want but still , its limiting me a lot i cant like switch from opencode to like my openwebui and chat there , because the contex have to transfered into ram and then it grows to 31.9GB used and its a mess :X

If plan on using a local model as a tool calling agent (opencode , codex , openclaw , claude code etc...) you need a fast model , so even if you have all this ram , you cant just load the biggest one because it will be slow as hell , 26b a4b is almost as smart as them , works really fast , 90-100tks on my rtx 3090.

I started like 1-2 weeks ago , before that i never considered local models to be viable , i only used claude. But after trying the qwen3.5 i was mind blown by the progress , it is literally claude 4.6 level (well not really but close) in local llm! crazy. And now ive been playing around with gemma 4 for 10 days and more amazed by it every day.

[-]

RedditSylus@reddit

Actually using, now that I think is like maybe 14b, 9b and5-6b. But using total of 20-25gb memory. Had to decrease models as when I hit build it crashed my machine using larger ones. To much memory got eaten up and boom. 💥

[-]

Material_Policy6327@reddit

Does anyone else run into this model cycling thinking over and over again?

[-]

cviperr33@reddit (OP)

yeah me and i explained how i fixed it in the post ☝️

[-]

andy2na@reddit

Have you tested it against qwen3.5-35B? How does it compare in coding, and all other tasks?

Also, you should try Crush, I like it a bit better than opencode

[-]

cviperr33@reddit (OP)

OMG I had to come back and thank you for the "Crush" recomendation ! its insane ! sooo much better :D i love it ! it explained to me what LSP's are and it auto configured them for me + installed screenshot mcp so it can pretty much do anything on its own and it doesnt stop ever unlike opencode or claude code.

[-]

andy2na@reddit

Yeah it's extremely helpful, and VERY pretty to watch it run 😂

[-]

cviperr33@reddit (OP)

I started with qwen3.5 35b then i switched to the dense 27b and then gemma came out and ive with gemma since.
Honestly i cannot tell the difference between the 2 , both are excellent in coding , and both can keep opencode going until they finish the whole project.
But gemma 4 26b a4b just ran like twice as fast , even fast than that lol , so i just stuck with it , since i got addicted to the speed and i cannot go back.

[-]

andy2na@reddit

interesting, I get avg 100t/s with qwen3.5-35B and 80t/s with gemma4-26B. Its likely because qwen is 3B active and gemma is 4B active

[-]

cviperr33@reddit (OP)

hmm maybe because im on windows and llama.ccp which has a lot of issues with qwen and their caching.
And yes i forgot to tell you that there is no way i could load full contex 260k in qwen , that was the primary reason i moved also , at 80-100k it was painfully slow , but gemma was like lighting fast no matter what contex size it is , and it can load at 260k , while qwen caps at 120-150k on 24gb vram

[-]

Cool-Chemical-5629@reddit

Gemma 4 MoE on my regular home PC can do things I used to admire about Claude 3.7 Sonnet. It doesn't have as much knowledge overall, but it's like a little Gemini for small hardware at home for emergencies when you lose the internet connection etc.

[-]

Kodix@reddit

Gemini CLI has *awful* availability/wait times, in my experience. And also uses flash a whole lot.

I legitimately prefer a solid and constantly available local gemma install, so far at least.

[-]

iansaul@reddit

Switch your API endpoint to EU, or whatever timezone is less active.

I've kept that one in a back pocket for months, figured it's ok to let the cat out of the bag a bit.

[-]

the__storm@reddit

Man, no matter what I do I cannot get the tool calling to work; on latest llama.cpp and redownloaded Unsloth IQ4_XS (both from this morning), and I've tried the llama.cpp jinja template workaround as well. Like 95% of the time it completely fails to call the tool and gets stuck in a loop of "Wait, I'll just do it." 5% of the time it successfully calls the tool but with bad parameters, like it'll insert a bunch of code instead of editing it.

Qwen 3.5 35B works perfectly fine with exact same setup (in fact it thinks like 1/20th as much, which is ironic considering the reputation Q3.5 had for overthinking), so I'm kind of at a loss.

[-]

anthonyg45157@reddit

Having very good results with these tips

[-]

JustSayin_thatuknow@reddit

I’ve found that using ctk and ctv at “bf16” (rather than the q8) it never more failed again with tool calls!! And the speed is just very slightly slower than it is with q8 so I recommend you to try it also!

[-]

90hex@reddit

How much (V)RAM does it take for full context? Gemma 4 left a sour taste in my mouth in that regard.

[-]

cviperr33@reddit (OP)

so at 240k i was at 22gb , at full it would probably be 22.5.

31B is way too slow for my liking , even tho its slightly smarter that the moe model , 26b is 5x as fast , which for agentic tools is perfect model

Try them now , they are significantly better since yesterday with the big updates folks at lama ccp and google did

[-]

jeremyckahn@reddit

at 240k i was at 22gb , at full it would probably be 22.5.

Is this for just the context window, or context window + weights (AKA grand total RAM usage)?

[-]

cviperr33@reddit (OP)

Thats contex window + weights ofcourse , the weights are really small because its IQ4_N_L they are like 14-16gb

[-]

jeremyckahn@reddit

Wow that is WAY better than I would have expected. Thanks for clarifying that!

[-]

wh33t@reddit

I'm guessing just KV+Context, weights would be an addition on top of that.

[-]

90hex@reddit

Great, I was reading about it in the LMStudio release notes. Not sure all of it made it through, but at least the prompt templates are supported.

[-]

cviperr33@reddit (OP)

yeah keep in mind that u have to redownload the GGUF's they dont autoupdate to my knowledge.

[-]

Opening-Broccoli9190@reddit

Try out this turboquant https://huggingface.co/LilaRest/gemma-4-31B-it-NVFP4-turbo - let me have some 700mb free after full setup and under load

[-]

z_latent@reddit

Just pointing out, this has nothing to do with Google's TurboQuant. It quantizes attention and MLP weights but not attention KV itself, which is what TQ does.

So this one won't help with long context much, beyond reducing the memory used by parameters themselves.

[-]

90hex@reddit

Is that the updated genma4? Thanks!

[-]

Opening-Broccoli9190@reddit

Don't know, but tool calling works

[-]

Ifihadanameofme@reddit

It ran on a stupid pixel 8 pro. Painful 1t/s with the Q3 quant I think but without gpu acceleration and non native support it made me wonder if someone makes dedicated MOE that can use GPU acceleration on these devices (on the elite chips from qcom ) then it might not be so bad and for non agentic work a lot of people MIGHT just use it .. smaller more efficient MOEs ofcourse.

[-]

cviperr33@reddit (OP)

LOL the fact that it even ran is achivement :D

[-]

Emotional-Look-7200@reddit

i want to use E4B or E2B for query replies on calls which doesn't require much thinking so i dont want big models also i can't actually locally run it. I tested it on ollama with rtx 3050 6gb laptop but it gave me about 18-19 T/s. Is there any way i could increase the speed as it is not enough and i when running either model i have some Vram available

[-]

ahbond@reddit

Gemma 4 long-context use case is exactly where KV cache compression matters. Gemma 4 A4B uses multi-query attention (very few KV heads), so the KV cache is only \~6 GB at 262K context with q8_0.

TurboQuant's asymmetric K4/V3 would bring the KV portion from \~6 GB to \~2.7 GB, enough headroom for another \~130K tokens of context on the same GPU. The real win is that you can drop value precision more aggressively than key precision without hurting attention quality, which llama.cpp's symmetric -ctk/-ctv flags don't expose.

[-]

Septerium@reddit

This model is fantastic. And it seems it was not benchmaxed at all, since its scores are not that impressive

[-]

cviperr33@reddit (OP)

Yeah lol it doesnt rank as high as qwen3.5 but it performs better imo in real work and it faster

[-]

Ayuzh@reddit

what all things did you use it for?

[-]

cviperr33@reddit (OP)

Well i dont know :D thats what im trying to find out now , a good use case for it , im just trying different model every day and settings , until i settle on a one that covers everything.

It has so many applications its practically limitless , but it has a lot of downsides , its not straight forward setup as opening an account and paying.

Think about it , a model almost as smart the frontier models , can handle agentic tool calling and work on its own but with a lot of quirks and downside , you just have to find a use case for it , you are not paying any API calls , just electricity , which is so small its practically free

[-]

Character_Split4906@reddit

Are you able to fit in 245k context window with model at q4 quant in 22 gb? I read gemma 4 26B model is seeing issue with tool calling. Did you face that issue?

[-]

cviperr33@reddit (OP)

the tool calling issues seems to be gone since today , both google and llama.ccp did some major updates in last 48h , also im running latest GGUF by unsloth that were released 6h ago. Currently im using the IQ4 N L and its just perfect for rtx 3090 , full contex 260k no issues maximum load is 22gb leaving 2gb headroom for breathing and windows overhead.

[-]

Character_Split4906@reddit

Thats amazing! Cant wait to try this on my mbp 5 pro. Last I tried gemma 4, I had issue with context window length growing up and model going in loop. Thanks for sharing

[-]

notdba@reddit

I thought IrisColt is a she?

[-]

jacek2023@reddit

Try agentic coding (opencode, codex, claude code). I am happy with the codex but need to test more.

[-]

cviperr33@reddit (OP)

So far ive tried Opencode and Claude code , Opencode is waaay faster and more responsive so i ditched claude code , i have no tried codex yet but that was the plan after i find my perfect model :D

[-]

minceShowercap@reddit

Any tips for using Opencode and choosing a model for a 5070ti (16gb vram)?

[-]

jacek2023@reddit

I mean Gemma 26B (I have 90t/s and I use Q8 GGUF). Tried it with the opencode and codex, I use claude code with claude only, but the plan is to try it too.

[-]

anthonyg45157@reddit

Trying these now, been having looping with standard settings even with the jinja template

[-]

Tintinlindo@reddit

How did you set it up to reason?

[-]

cviperr33@reddit (OP)

it does that by default , atleast in llama.ccp i didnt had to do anything , you can see my launch parameters.
I dont think you can turn off gemma 4 reasoning , it will still output a reason tag even if its empty if u have it disabled

[-]

vogelvogelvogelvogel@reddit

very interesting thank you!

[-]

Heavy_Boss_1467@reddit

I had to decrease the temp and bump the penalty to 1.18 so it doesnt fall into a loop of self questioning

That new release with the latest updates of llama.cpp is looping again like it did on release day,

Ill give your settings a try, thanks.