Gemma4 8B model shows up on ollama as gemma4:latest?

[-]

This is probably just the E4B Model thats *actually* 8B but due to its architecture performs similar to a 4B in terms of compute requirements. E2B and E4B are kinda weird in that way as they have significantly bigger embeddings then usual.

[-]

_mayuk@reddit

But that is the magic about those 2 small llms c:

[-]

Far-Low-4705@reddit

yeah idk tho, even the small models get beat by qwen 3.5 4b and 2b (let alone qwen 3.5 9b)

[-]

_mayuk@reddit

Im planning to modify it and apply the hope nested learning architecture of embedding to ad some phonetic embedding in the middle layer and a gematric embedding in the low layer .. ( using complex/imaginary numbers for those vectors ) to make the AI to “see” not just semantic correlations….

So for me is perfect , I would see if I can use the already working embedding with it hehe…

[-]

theplayerofthedark@reddit

Yea theyre pretty good, especially if audio input works in llama cpp they'll be even better

[-]

Rude_Marzipan6107@reddit

Keep in mind audio input is limited to 30 seconds with those models

[-]

AppealThink1733@reddit

The question is whether this functionality will arrive quickly in llama.cpp.

[-]

Fun_Librarian_7699@reddit

Yeah, E2B has in conclusion 5,1B and E4B is 8B. They choose this name because it only uses 2B and 4B in RAM

[-]

Ok_Mine189@reddit

It uses the same compute as 2B or 4B, but it uses more RAM.

[-]

x0wl@reddit

Actually it load the full weights into RAM. You can split the model and load the PLEs into system RAM and the rest into VRAM and have a lot of performance, but you still have to load all of it.

I think you maybe can mmap the PLEs and use them from disk, but it will be much slower.

[-]

z_latent@reddit

Oh, not slower at all! A good implementation should have minimal impact from SSD off-loading.

The PLEs are insanely small compared to loading MoE experts, we're talking loading 14 kB per token (E4B PLE) vs 1.4 GB (26B A4B MoE). That's 100 000x less!

Now you could argue, "but SSDs can be so much slower it still bottlenecks?" Well, SSDs are not that slow.

Since we're reading 14kB at a time, we won't saturate the SSD's read speed, so latency will dominate. But even if you assume a generously high latency of \~100 μs, the SSD would certainly not be the bottleneck until you did at least 1 000+ tok/s speeds! In a perfect world with no extra SSD overhead you could reach 10 000 tok/s theoretical max before it bottlenecks (but you'd certainly bottleneck from RAM or GPU first anyways)

In short: SSDs are kinda awesome and we should appreciate them more <3

[-]

x0wl@reddit

I won't argue lol, I just haven't seen an implementation of that yet. I'd be very happy if it works like this

[-]

z_latent@reddit

Aw, I was hoping to rant about it for an hour longer...

jokes aside, I also hope to see that as well. We just need someone to implement it to llama.cpp or something.

[-]

z_latent@reddit

And because it only requires the computation of a 2B or 4B model!

It's like MoE but selecting "experts" via the token ID rather than dynamically through a router. That means you have far more "experts", 100000's instead of 100's, but much smaller, and only one is loaded per processed token. That's why it uses so little bandwidth, and why it's viable to store them in SSD, unlike MoE.

[-]

FrogsJumpFromPussy@reddit

And because it only requires the computation of a 2B or 4B model!

That's not what I experience on my ipad. The 8b model won't even load. No issue with qwen 9b (same quant for both)

[-]

z_latent@reddit

Hmm? Whatever you're using might still have a w.i.p. Gemma 4 implementation, since it is still recent. For example, it seems llama.cpp's implementation had issues up until today.

In principle, I see no reason why the model should fail to load when Qwen 3.5 9B loads fine. I at least know that someone managed to get the model running on their iPhone, using the Google Edge Gallery app. Maybe give that a try?

[-]

FrogsJumpFromPussy@reddit

Pocketpal finally supports Gemma 4 on ipad m1 8gbram. I have no issue to run Qwen 4b q6_k on it but Gemma E4B won't even load at small quants. I have no issue with qwen 9b iq3. Only e2b works.

[-]

Specter_Origin@reddit

I kind of gave up on ollama over a year ago thanks to these naming shenanigans.

[-]

Far-Low-4705@reddit

if you view the model list, there are actually full names, these are just shorthands for those that are not as technical. for example: `gemma4:e4b-it-q4_K_M` or `gemma4:e4b-it-q8_0`

[-]

Herr_Drosselmeyer@reddit

Yeah, I had to use it for a while, it annoyed me too. Also how they made it needlessly complicated to load a model I'd just downloaded myself. Been using KoboldCPP for a while now and it doesn't have that issue, just point it at the model file, choose settings and load it. It shouldn't be any more complicated than that.

[-]

Minute_Attempt3063@reddit

odd naming

[-]

yuicebox@reddit

Cannot recommend enough switching away from ollama and just using llama.cpp directly.

ollama is essentially a monetized fork of llama.cpp that adds unnecessary abstraction layers and constraints.

Sure, it may make downloading a model easy, but it names that model with an incomprehensible hash and stores in some random folder.

llama.cpp respects your intelligence, so you can store your models anywhere, name your .gguf files coherently, and use any model/quant you want without creating modelfiles.

I used to recommend llama-swap, which is still great, but more recent versions of llama.cpp server now offer every feature I really want. I run it in docker and have a config.ini which controls model-specific settings.

[-]

lemon07r@reddit

Ollama bad anyways, better off using something else, and ironically both simpler and easier too, to use something like lcpp or kcpp.

[-]

Powerful_Evening5495@reddit

no , it it correct

https://huggingface.co/google/gemma-4-E4B-it

8b model

[-]

robberviet@reddit

Ollama? Haha no. They messed up (on purpose) the naming game long ago.

[-]

ghulamalchik@reddit

I think it's true. If you noticed Gemma 4 E4B is noticeably larger than typical 4B models, double the size in fact. It's because the "E" in E4B refers to "effective parameters", not total. Total is probably 8B. Kinda like MoE.

[-]

z_latent@reddit

Yep. It's like MoE but selecting "experts" via the token ID rather than dynamically through a router. That means you have far more "experts", 100000's instead of 100's, but much smaller, and only one is loaded per processed token. That's why it uses so little bandwidth (and compute), and why it's viable to store them in SSD, unlike MoE.

(copied from another reply I made)

[-]