What are your thoughts on Gemma2 27B and 9B?
Posted by Discordpeople@reddit | LocalLLaMA | View on Reddit | 112 comments
Last time, I tried Gemma 1.1 7B and the results are really terrible. Now with Gemma2 9B, not only did I get results that are on par with Llama3 8B, sometimes even better than it. Gemma2 27B performs really well in my experience for its size, it can replace Llama3 70B for a lot of stuff. That's just my opinion, I wanna hear your opinions on the new Google models. Does it perform well on your side?
Links to the models:
Gemma2 9B IT, Gemma2 27B IT, Gemma2 9B Base, Gemma2 27B Base
Different formats:
brown2green@reddit
Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (definitely less than Llama-3-8B-Instruct).
I haven't tried them in all scenarios yet though, only typical multi-turn RP.
Straight-Habit-919@reddit
"napisałem" bardzo fajną książkę na modelu gemma 2 27b, super sie czyta, polecam
noneabove1182@reddit
to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?
brown2green@reddit
I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.
noneabove1182@reddit
yeah so my only concern is that it's possible that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables
_sqrkl@reddit
Per this pr fix waiting to be merged, you will need to re-generate ggufs.
noneabove1182@reddit
Not strictly because they will have a default value, but I intend to do it anyways because presumably my imatrix isn't as good as it could be
brown2green@reddit
I tried again after updating to
transformers4.42.1 (the same one suggested in the latest HF blogpost), forcing--outtype bf16in the initial conversion before quantization, and it gave the same results.noneabove1182@reddit
Well dam 😂 I appreciate your thorough investigation. I wonder if it's the logit soft cap thing then
brown2green@reddit
I think it might have fixed the issues, or at least I'm not getting obviously incoherent responses anymore.
PavelPivovarov@reddit
Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.
brown2green@reddit
Try with some prefill or a "system" prompt (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.
fallingdowndizzyvr@reddit
Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.
mrjackspade@reddit
Did you pull the tokenizer update for it? Apparently it was broken so the GGUFs are borked
brown2green@reddit
Yes, the special tokens were being recognized (checked out via verbose llamacpp-server output).
mikael110@reddit
I've had the exact same experience, and I've seen a number of other people say the same thing. I downloaded the model after the tokenizer fix and yet it's just completely broken in the test I usually do (which is a data extraction task). And when I say broken I mean that it doesn't even manage to generate the right keys in the JSON it outputs, which the 9B models does flawlessly every time.
There are issues raised in various places about this like this llama.cpp issue, and this issue in the official Gemma repo. In the latter issue Google states that they are currently investigating this issue, so hopefully it gets fixed soon.
PopularPrivacyPeople@reddit
I can run the 9b in an acceptable way 2/3tps on my 16GB RAM AMD Ryzen 7, AMD Radeon integrated graphics (unused, only slows things down) I can get the Q3 XXS quant of the 27b to stagger, 1 tps max, but the content is poor and confused, not worth it for me.
Gemma2 9b and Kunoichi 2DPO are still the best available for my use-case (story writing)
feelosofee@reddit
fps...
Difficult-Slice-5747@reddit
I am quite late to the party but running on a rtx 4090 and 64gb of system ram and AMD 16 core 32 thread CPU I am getting the perfect combination of accuracy and speedy responses. I run it in the base program Brianna for ease of use and text to speech. It can write short stories about two pages long in less than 30 seconds and then read it to me. It cannot do five digit multiplication accurately. I tried for example 654 times 98127 and although it got really close it could not finish any five digit multiplication accurately. This will be my new personal AI for the foreseeable future. "I'm running out of hardware resources at any scale higher than about 30b"
felippelazarbr@reddit
I have using LLMs to understand and classify abstracts according to selection criteria (to simulate a systematic review) and I particularly found Gemma2 9B way better than Llama 3 and Llama 3.1 (8B versions for both). Has anyone had the same impression?
_sqrkl@reddit
Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html
It's killing it.
I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.
cyan2k@reddit
But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait.
For once google actually delivered. what a time to be a live.
ILoveThisPlace@reddit
Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.
raika11182@reddit
Now that there's a version that's had refusals tuned out of it (Big Tiger Gemma) I gave it a shot and I'm pretty impressed. I see pros and cons vs Llama 3 70B in my use, so I don't know that you can really say it's smarter, but it's definitely competitive and it is far, far better at writing prose.
ILoveThisPlace@reddit
Hmm good to know. Is it multi-modal?
raika11182@reddit
No, unfortunately, But neither is Llama 3 70B (yet - we've been promised that later).
AssociateDeep2331@reddit
I'm having a problem with gemma-2-9b-it
I say "write a story about. do it one chapter at a time. write chapter 1".
It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1".
What am I doing wrong? Llama3 handles this type of prompt just fine
_sqrkl@reddit
I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using?
Interpause@reddit
sopho released a llama 3 70b merge called new dawn. supposedly on par but different from midnight miqu. would it be possible to test? thanks
mikael110@reddit
A fix has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.
oobabooga4@reddit
I think that 27b is still broken in transformers 4.42.3. 27b in bfloat16 precision performs worse than 9b in my benchmark.
capivaraMaster@reddit
Same here.
_sqrkl@reddit
Yep! Seems to be fixed now. I've added the scores to the eq-bench leaderboard, will run the creative writing benchmark overnight.
toothpastespiders@reddit
Awesome to hear! I've been holding out on trying it until I saw someone with problems confirm a fix actually worked.
_sqrkl@reddit
It's still not fixed in llama.cpp afaik. This branch is where they're working on it if anyone wants to keep an eye on progress.
brahh85@reddit
can you add qwen 2 72b ? maybe im seeing ghosts, but testing gemma 2 9b i was feeling the 9b was on pair or better for RP.
fervoredweb@reddit
I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).
_sqrkl@reddit
Some other oddities:
None of these are included in the prompt (although the latter two at least are a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.
AnOnlineHandle@reddit
Wow those examples are impressive.
I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.
Feztopia@reddit
Wow, 9b is between Sonet 3.5 and Gpt4 and Sonet is the judge, that's cool.
_sqrkl@reddit
I suspect they used (at least some of) their gemini dataset to train these models. The latest gemini pro is fantastic for creative writing imo.
Saifl@reddit
Gemma with fine tuning might be better? And it'll end up cheaper too per token or locally (hopefully)
Feztopia@reddit
I'm curious how well the base model is, I think this time it's the finetune which does all the heavy lifting.
thereisonlythedance@reddit
It’s the model I was hoping for based on how good Gemini Pro 1.5 is at writing tasks. Feels like a mini version of it.
jollizee@reddit
Whoa, that's crazy for a 9b model. Love your v2 updates as well.
ipechman@reddit
not so much on magi-hard lol
datavisualist@reddit
I can easily say that Gemma2 9b is better at multilingual translations than Llama3 8b.
Tim-Fra@reddit
I prefer mixtral and mistral v3... Gemma2 9b & 27b don't support rag on my openui / ollama server Llama3 8b is lower than mistral v3... for me
AlexanderKozhevin@reddit
Is there any good jupyter notebook for gemma2 fine-tuning?
sdujack2012@reddit
I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.
Motor_Ocelot_1547@reddit
in Korean task.. 9b is better
a_beautiful_rhind@reddit
If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.
large_diet_pepsi@reddit
With VC money booster, there could be things that's too good to be true
reissbaker@reddit
I'm pretty sure Google, a public company since 2004, isn't taking VC money in 2024 ;)
large_diet_pepsi@reddit
You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players.
Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.
SatoshiNotMe@reddit
I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid,
https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py
and it didn’t do well at all.
With Llama3-70b it works perfectly. I wrote about it here :
https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z
raiffuvar@reddit
Can it run on cpu? Still Haven't figured out how to calculate VRAM etc.
Eliiasv@reddit
For GGUF and similar quants there's not much to figure out:
model-file-size.gguf x 1.2 = vRAM requirements.
This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.
EmilPi@reddit
I did my favorite logical reasoning test: first I asked
"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct.
Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed.
I used LMStudio for testing.
ASYMT0TIC@reddit
Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?
Dry_Parfait2606@reddit
They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest
harrro@reddit
4 bit doesn't mean 15 variables in this case because there is a multiplier applied over each 4 bit int that tries to get the precision back up to fp16ish levels at least.
I'm not an expert on quant and don't know the full details but the above is a massively simplified version of what happens with quantization.
Dry_Parfait2606@reddit
But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost.
I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance...
You can have the smartest mouse on this planet, but network size can be a good predictor...
Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded.
Let's go jnto deeeep theory of information technology and theory.
Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system...
But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it...
I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough...
You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe...
Llama3 8b is good. (rather ok/sub-ok) But still not enough...
I hope I can finally try llama3 70b...
social_tech_10@reddit
You can try Llama3-70B for free at [https://duck.ai]
Didi_Midi@reddit
[This] is a very good summary of how quants work in GGUF.
CarpenterHopeful2898@reddit
good article
papipapi419@reddit
Even I was thinking the same thing
CortaCircuit@reddit
How does it compare to deepseek coder and lamma3 for coding?
ihaag@reddit
Deepseek coder V2 is much better for coding, it actually gave correct code were Gemma had so many mistakes
CortaCircuit@reddit
I have been using deepseek coder V1. Unfortunately, using my GPU doesnt have enough vram for V2.
ihaag@reddit
The gguf version is okay I used Q4 doesn’t seem as accurate but still much better than any of the others I’ve used
Seromelhor@reddit
To be honest, Deepseek (api) has the same quality as Gpt-4 for me in 97% of situations. Costing much less. (0.28 vs 15)
ab_drider@reddit
Does it generate dirty things for role play?
ZaggyChum@reddit
No, only black space nazi stories I'm afraid.
Biggest_Cans@reddit
We'll have to make do
Cyber-exe@reddit
Better off getting a server sleeve of P40's and running Grok
Biggest_Cans@reddit
Haven't been able to get any GGUFs to run in ooba or koboldcpp, so I dunno
mahadevbhakti@reddit
How's it for function calling, anyone tried?
sammcj@reddit
8K context = dead on arrival.
crazzydriver77@reddit
Can someone explain to me the hype around one of the many probability parrots??
Discordpeople@reddit (OP)
These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.
Amgadoz@reddit
You are in the wrong sub.
papipapi419@reddit
Llama 3 used to fail on structured output tasks (Rolling window) given a document give a json output section wise (section : section_content) Paid APIs were getting too expensive Let’s hope this does it accurately enough
Eisenstein@reddit
For structured output, try SFR-Iterative, Codestral, Phi-3, or Qwen2. In my testing they all could output structured json if asked.
Specialist-2193@reddit
It is doing surprisingly well on multilingual tasks.
Western_Soil_4613@reddit
which?
F_Kal@reddit
I was particularly impressed with Greek - I think it's the first model that performs well
Specialist-2193@reddit
Non Latin languages
vasileer@reddit
I don't think you know all of them, so which ones have you tested?
PavelPivovarov@reddit
Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...
LLMtwink@reddit
i tried ukrainian, can confirm🙋
privacyparachute@reddit
I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos.
Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.
noneabove1182@reddit
Anyone know if the ollama models are updated to the fixed versions? i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure
/u/agntdrake ?
agntdrake@reddit
Do you know what got fixed in the other versions? We have our own conversion code for Ollama separate from Google/llama.cpp's scripts.
noneabove1182@reddit
the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens
agntdrake@reddit
We just ended pushing some updates to the models if you want to re-pull.
de4dee@reddit
Having no system prompt is not good. It doesn't properly follow my system prompt when I give it as a user request..
It seems to love markdown format.. Lots of bolds in the reply.
PavelPivovarov@reddit
Yeah and my LLM telegram bit also constantly throwing markdown parsing errors at the gemma2 generated responses. Sweet!
LoSboccacc@reddit
tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.
candre23@reddit
If you're using llama.cpp (or something based on it), you're going to have a bad time until they implement SWA.
LLMtwink@reddit
it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny
StableLlama@reddit
Where are uncensored versions that are working well?
With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.
Alexandratang@reddit
I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point.
ambient_temp_xeno@reddit
It's local llm tradition for things to be broken for at least a day, and this seems to have been the case again.
Barry_22@reddit
Is it multilingual like llama or English-only?
neosinan@reddit
Yes they are multilingual, I've seen many test with more obscure languages with good results not just other big languages.
Plus_Complaint6157@reddit
multi
grigio@reddit
I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response
Dry_Parfait2606@reddit
I found this... There is probably a problem when compressing LLMs
----_____---------@reddit
9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive.
I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better.
I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.
Lightninghyped@reddit
Gemma2 27B is surprisingly good on multilingual tasks. At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too. The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.
dimsumham@reddit
Agree. These models are cracked.