Gemma4 8B model shows up on ollama as gemma4:latest?
Posted by k_means_clusterfuck@reddit | LocalLLaMA | View on Reddit | 31 comments
https://ollama.com/library/gemma4:latest
Is this a new model or just an error?
Hammer-Evader-5624@reddit
because ollama is stupid
theplayerofthedark@reddit
This is probably just the E4B Model thats *actually* 8B but due to its architecture performs similar to a 4B in terms of compute requirements. E2B and E4B are kinda weird in that way as they have significantly bigger embeddings then usual.
_mayuk@reddit
But that is the magic about those 2 small llms c:
Far-Low-4705@reddit
yeah idk tho, even the small models get beat by qwen 3.5 4b and 2b (let alone qwen 3.5 9b)
_mayuk@reddit
Im planning to modify it and apply the hope nested learning architecture of embedding to ad some phonetic embedding in the middle layer and a gematric embedding in the low layer .. ( using complex/imaginary numbers for those vectors ) to make the AI to “see” not just semantic correlations….
So for me is perfect , I would see if I can use the already working embedding with it hehe…
theplayerofthedark@reddit
Yea theyre pretty good, especially if audio input works in llama cpp they'll be even better
Rude_Marzipan6107@reddit
Keep in mind audio input is limited to 30 seconds with those models
AppealThink1733@reddit
The question is whether this functionality will arrive quickly in llama.cpp.
Fun_Librarian_7699@reddit
Yeah, E2B has in conclusion 5,1B and E4B is 8B. They choose this name because it only uses 2B and 4B in RAM
Ok_Mine189@reddit
It uses the same compute as 2B or 4B, but it uses more RAM.
x0wl@reddit
Actually it load the full weights into RAM. You can split the model and load the PLEs into system RAM and the rest into VRAM and have a lot of performance, but you still have to load all of it.
I think you maybe can mmap the PLEs and use them from disk, but it will be much slower.
z_latent@reddit
Oh, not slower at all! A good implementation should have minimal impact from SSD off-loading.
The PLEs are insanely small compared to loading MoE experts, we're talking loading 14 kB per token (E4B PLE) vs 1.4 GB (26B A4B MoE). That's 100 000x less!
Now you could argue, "but SSDs can be so much slower it still bottlenecks?" Well, SSDs are not that slow.
Since we're reading 14kB at a time, we won't saturate the SSD's read speed, so latency will dominate. But even if you assume a generously high latency of \~100 μs, the SSD would certainly not be the bottleneck until you did at least 1 000+ tok/s speeds! In a perfect world with no extra SSD overhead you could reach 10 000 tok/s theoretical max before it bottlenecks (but you'd certainly bottleneck from RAM or GPU first anyways)
In short: SSDs are kinda awesome and we should appreciate them more <3
x0wl@reddit
I won't argue lol, I just haven't seen an implementation of that yet. I'd be very happy if it works like this
z_latent@reddit
Aw, I was hoping to rant about it for an hour longer...
jokes aside, I also hope to see that as well. We just need someone to implement it to llama.cpp or something.
z_latent@reddit
And because it only requires the computation of a 2B or 4B model!
It's like MoE but selecting "experts" via the token ID rather than dynamically through a router. That means you have far more "experts", 100000's instead of 100's, but much smaller, and only one is loaded per processed token. That's why it uses so little bandwidth, and why it's viable to store them in SSD, unlike MoE.
FrogsJumpFromPussy@reddit
And because it only requires the computation of a 2B or 4B model!
That's not what I experience on my ipad. The 8b model won't even load. No issue with qwen 9b (same quant for both)
z_latent@reddit
Hmm? Whatever you're using might still have a w.i.p. Gemma 4 implementation, since it is still recent. For example, it seems llama.cpp's implementation had issues up until today.
In principle, I see no reason why the model should fail to load when Qwen 3.5 9B loads fine. I at least know that someone managed to get the model running on their iPhone, using the Google Edge Gallery app. Maybe give that a try?
FrogsJumpFromPussy@reddit
Pocketpal finally supports Gemma 4 on ipad m1 8gbram. I have no issue to run Qwen 4b q6_k on it but Gemma E4B won't even load at small quants. I have no issue with qwen 9b iq3. Only e2b works.
Specter_Origin@reddit
I kind of gave up on ollama over a year ago thanks to these naming shenanigans.
Far-Low-4705@reddit
if you view the model list, there are actually full names, these are just shorthands for those that are not as technical. for example: `gemma4:e4b-it-q4_K_M` or `gemma4:e4b-it-q8_0`
Herr_Drosselmeyer@reddit
Yeah, I had to use it for a while, it annoyed me too. Also how they made it needlessly complicated to load a model I'd just downloaded myself. Been using KoboldCPP for a while now and it doesn't have that issue, just point it at the model file, choose settings and load it. It shouldn't be any more complicated than that.
Minute_Attempt3063@reddit
odd naming
yuicebox@reddit
Cannot recommend enough switching away from ollama and just using llama.cpp directly.
ollama is essentially a monetized fork of llama.cpp that adds unnecessary abstraction layers and constraints.
Sure, it may make downloading a model easy, but it names that model with an incomprehensible hash and stores in some random folder.
llama.cpp respects your intelligence, so you can store your models anywhere, name your .gguf files coherently, and use any model/quant you want without creating modelfiles.
I used to recommend llama-swap, which is still great, but more recent versions of llama.cpp server now offer every feature I really want. I run it in docker and have a config.ini which controls model-specific settings.
lemon07r@reddit
Ollama bad anyways, better off using something else, and ironically both simpler and easier too, to use something like lcpp or kcpp.
Powerful_Evening5495@reddit
no , it it correct
https://huggingface.co/google/gemma-4-E4B-it
8b model
robberviet@reddit
Ollama? Haha no. They messed up (on purpose) the naming game long ago.
ghulamalchik@reddit
I think it's true. If you noticed Gemma 4 E4B is noticeably larger than typical 4B models, double the size in fact. It's because the "E" in E4B refers to "effective parameters", not total. Total is probably 8B. Kinda like MoE.
z_latent@reddit
Yep. It's like MoE but selecting "experts" via the token ID rather than dynamically through a router. That means you have far more "experts", 100000's instead of 100's, but much smaller, and only one is loaded per processed token. That's why it uses so little bandwidth (and compute), and why it's viable to store them in SSD, unlike MoE.
(copied from another reply I made)
Mashic@reddit
It's something like 4.5B parameters model with 3.5B in embeddings.
tvall_@reddit
8b would be the e4b iirc
sebaxzero@reddit
google/gemma-4-E4B-it