What are your thoughts on Gemma2 27B and 9B?

[-]

brown2green@reddit

Gemma-2-27B-it doesn't work correctly for me, and I tried downloading it from the Google repository and converting it myself to GGUF. Comparing it to the 9B model, I'd even say it is broken in some way. Gemma-2-9B-it on the other hand almost appears to be an uncensored model, often surprisingly so (definitely less than Llama-3-8B-Instruct).

I haven't tried them in all scenarios yet though, only typical multi-turn RP.

[-]

Straight-Habit-919@reddit

"napisałem" bardzo fajną książkę na modelu gemma 2 27b, super sie czyta, polecam

[-]

noneabove1182@reddit

converting it myself to GGUF

to that end: I personally updated transformers to the wheel they provided before doing the conversion since that package is used with LlamaHfVocab, but I wasn't sure if it was needed.. did you do the same or did you use the released transformers?

[-]

brown2green@reddit

I just pulled the latest llamacpp changes, made a clean build and quantized the model as usually done. After checking, it's on transformers 4.40, for what it's worth. Anyway, it's unclear why only the 27B version would be so negatively affected.

[-]

noneabove1182@reddit

yeah so my only concern is that it's possible that conversion would not be perfect if you aren't on their transformers wheels, but i have no proof of that, only a theory and trying to minimize variables

[-]

_sqrkl@reddit

Per this pr fix waiting to be merged, you will need to re-generate ggufs.

[-]

noneabove1182@reddit

Not strictly because they will have a default value, but I intend to do it anyways because presumably my imatrix isn't as good as it could be

[-]

brown2green@reddit

I tried again after updating to transformers 4.42.1 (the same one suggested in the latest HF blogpost), forcing --outtype bf16 in the initial conversion before quantization, and it gave the same results.

[-]

noneabove1182@reddit

Well dam 😂 I appreciate your thorough investigation. I wonder if it's the logit soft cap thing then

[-]

brown2green@reddit

I think it might have fixed the issues, or at least I'm not getting obviously incoherent responses anymore.

[-]

PavelPivovarov@reddit

Speaking of uncensored: gemma2 was the first model to refuse providing me a python code using Cookie header for authentication because "it's not recommended practise"... Haven't tested it with RP yet but for me it didn't look more uncensored than llama3.

[-]

brown2green@reddit

Try with some prefill or a "system" prompt (even if officially the model doesn't support that) rather than zero-shotting your questions. For RP, I've tried a limited set of "outrageous" scenarios and it works, whereas Llama-3-Instruct would refuse harshly.

[-]

fallingdowndizzyvr@reddit

Try downloading the bartowski GGUF. The first version of 27B he posted yesterday was wacky. Like really wacky. But the updated version he posted last night works much much better.

[-]

mrjackspade@reddit

Did you pull the tokenizer update for it? Apparently it was broken so the GGUFs are borked

[-]

brown2green@reddit

Yes, the special tokens were being recognized (checked out via verbose llamacpp-server output).

[-]

mikael110@reddit

I've had the exact same experience, and I've seen a number of other people say the same thing. I downloaded the model after the tokenizer fix and yet it's just completely broken in the test I usually do (which is a data extraction task). And when I say broken I mean that it doesn't even manage to generate the right keys in the JSON it outputs, which the 9B models does flawlessly every time.

There are issues raised in various places about this like this llama.cpp issue, and this issue in the official Gemma repo. In the latter issue Google states that they are currently investigating this issue, so hopefully it gets fixed soon.

[-]

PopularPrivacyPeople@reddit

I can run the 9b in an acceptable way 2/3tps on my 16GB RAM AMD Ryzen 7, AMD Radeon integrated graphics (unused, only slows things down) I can get the Q3 XXS quant of the 27b to stagger, 1 tps max, but the content is poor and confused, not worth it for me.

Gemma2 9b and Kunoichi 2DPO are still the best available for my use-case (story writing)

[-]

feelosofee@reddit

fps...

[-]

Difficult-Slice-5747@reddit

I am quite late to the party but running on a rtx 4090 and 64gb of system ram and AMD 16 core 32 thread CPU I am getting the perfect combination of accuracy and speedy responses. I run it in the base program Brianna for ease of use and text to speech. It can write short stories about two pages long in less than 30 seconds and then read it to me. It cannot do five digit multiplication accurately. I tried for example 654 times 98127 and although it got really close it could not finish any five digit multiplication accurately. This will be my new personal AI for the foreseeable future. "I'm running out of hardware resources at any scale higher than about 30b"

[-]

felippelazarbr@reddit

I have using LLMs to understand and classify abstracts according to selection criteria (to simulate a systematic review) and I particularly found Gemma2 9B way better than Llama 3 and Llama 3.1 (8B versions for both). Has anyone had the same impression?

[-]

_sqrkl@reddit

Added gemma-2-9b-it to the creative writing leaderboard: https://eqbench.com/creative_writing.html

It's killing it.

I'm having the same issues as others with the 27b version. Hopefully will be ironed out soon.

[-]

cyan2k@reddit

But even with the issues it's pretty clear that gemma 27b is a beast. afaik the fix is already on its way. and I personally can't wait.

For once google actually delivered. what a time to be a live.

[-]

ILoveThisPlace@reddit

Jesus, we'll see if my tune changes. Guess I gotta test this thing. I've been highly critical of Google. 27B is around the perfect size for a 24GB VRAM consumer card (I haven't looked, just a guess) but models in this range will become increasingly more important.

[-]

raika11182@reddit

Now that there's a version that's had refusals tuned out of it (Big Tiger Gemma) I gave it a shot and I'm pretty impressed. I see pros and cons vs Llama 3 70B in my use, so I don't know that you can really say it's smarter, but it's definitely competitive and it is far, far better at writing prose.

[-]

ILoveThisPlace@reddit

Hmm good to know. Is it multi-modal?

[-]

raika11182@reddit

No, unfortunately, But neither is Llama 3 70B (yet - we've been promised that later).

[-]

AssociateDeep2331@reddit

I'm having a problem with gemma-2-9b-it

I say "write a story about . do it one chapter at a time. write chapter 1".

It writes chapter 1 just fine. Then I say "write chapter 2" and it acts totally confused, it says stuff like "please provide me with chapter 1".

What am I doing wrong? Llama3 handles this type of prompt just fine

[-]

_sqrkl@reddit

I didn't actually test multi-turn. What are you using for inference? Which version/quant are you using?

[-]

Interpause@reddit

sopho released a llama 3 70b merge called new dawn. supposedly on par but different from midnight miqu. would it be possible to test? thanks

[-]

mikael110@reddit

A fix has just been merged into Transformers, so if you update to the latest version the 27b model should behave properly.

[-]

oobabooga4@reddit

I think that 27b is still broken in transformers 4.42.3. 27b in bfloat16 precision performs worse than 9b in my benchmark.

[-]

capivaraMaster@reddit

Same here.

[-]

_sqrkl@reddit

Yep! Seems to be fixed now. I've added the scores to the eq-bench leaderboard, will run the creative writing benchmark overnight.

[-]

toothpastespiders@reddit

Awesome to hear! I've been holding out on trying it until I saw someone with problems confirm a fix actually worked.

[-]

_sqrkl@reddit

It's still not fixed in llama.cpp afaik. This branch is where they're working on it if anyone wants to keep an eye on progress.

[-]

brahh85@reddit

can you add qwen 2 72b ? maybe im seeing ghosts, but testing gemma 2 9b i was feeling the 9b was on pair or better for RP.

[-]

fervoredweb@reddit

I might misunderstand how the board sample outputs next to the model scores work, but a large number of the models have a weirdly similar opening to the first romance story prompt (bells twinkling as a door opens).

[-]

_sqrkl@reddit

Some other oddities:

The male actor character in the story is named Rhys by about half of the models
The gladiator story almost always starts with a description of the rising sun
The Virginia Woolfe prompt always begins with a description of the protagonist waking in their bedroom

None of these are included in the prompt (although the latter two at least are a reasonably inferred starting point). I guess they are just points of natural convergence of probable tokens for language models.

[-]

AnOnlineHandle@reddit

Wow those examples are impressive.

I started skipping the prompts to see if the writing was more impressive going in blind like you normally would to stories, and the 'Epistolary Apocalyptic Survival' one was actually moving.

[-]

Feztopia@reddit

Wow, 9b is between Sonet 3.5 and Gpt4 and Sonet is the judge, that's cool.

[-]

_sqrkl@reddit

I suspect they used (at least some of) their gemini dataset to train these models. The latest gemini pro is fantastic for creative writing imo.

[-]

Saifl@reddit

Gemma with fine tuning might be better? And it'll end up cheaper too per token or locally (hopefully)

[-]

Feztopia@reddit

I'm curious how well the base model is, I think this time it's the finetune which does all the heavy lifting.

[-]

thereisonlythedance@reddit

It’s the model I was hoping for based on how good Gemini Pro 1.5 is at writing tasks. Feels like a mini version of it.

[-]

jollizee@reddit

Whoa, that's crazy for a 9b model. Love your v2 updates as well.

[-]

ipechman@reddit

not so much on magi-hard lol

[-]

datavisualist@reddit

I can easily say that Gemma2 9b is better at multilingual translations than Llama3 8b.

[-]

Tim-Fra@reddit

I prefer mixtral and mistral v3... Gemma2 9b & 27b don't support rag on my openui / ollama server Llama3 8b is lower than mistral v3... for me

[-]

AlexanderKozhevin@reddit

Is there any good jupyter notebook for gemma2 fine-tuning?

[-]

sdujack2012@reddit

I tried Gemma2 27B and 9B today, but they didn't impress me. I use LLMs to generate image prompts for every 2-3 sentences of a story and output a JSON array. Both Gemma2 models had issues: syntax errors in the JSON array or blank image prompts. I switched back to Llama3, which works perfectly.

[-]

Motor_Ocelot_1547@reddit

in Korean task.. 9b is better

[-]

a_beautiful_rhind@reddit

If something sounds too good to be true, it probably is. Maybe they are good riddlers but the 27b on hugging-chat was nothing to write home about. Beating claude-3 sonnet my ass.

[-]

large_diet_pepsi@reddit

With VC money booster, there could be things that's too good to be true

[-]

reissbaker@reddit

I'm pretty sure Google, a public company since 2004, isn't taking VC money in 2024 ;)

[-]

large_diet_pepsi@reddit

You're right, Google, as a major public company, doesn't rely on VC money. What I meant was that Google's substantial financial resources can sometimes lead to innovation that seems "too good to be true." With their deep pockets, they're often able to push the boundaries of technology further and faster than many smaller players.

Additionally, teams within FAANG companies have the advantage of being fault-tolerant and well-prepared for innovation due to their resources and experience. However, I also believe that the open-source community will continue to make significant strides and eventually catch up.

[-]

SatoshiNotMe@reddit

I tried Gemma2-27b via Ollama on a multi-agent + tool workflow script with Langroid,

https://github.com/langroid/langroid/blob/main/examples/basic/chat-search-assistant.py

and it didn’t do well at all.

With Llama3-70b it works perfectly. I wrote about it here :

https://www.reddit.com/r/LocalLLaMA/s/wLgJ07X02Z

[-]

raiffuvar@reddit

Can it run on cpu? Still Haven't figured out how to calculate VRAM etc.

[-]

Eliiasv@reddit

For GGUF and similar quants there's not much to figure out:

model-file-size.gguf x 1.2 = vRAM requirements.

This is quite accurate; for most intents and purposes, I do 1.25 to account for large ctx window.

[-]

EmilPi@reddit

I did my favorite logical reasoning test: first I asked
"Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?" - both models answered correct.
Then I rephrased the riddle. Gemma 9B Q8 quant still answered, Gemma 27B Q6 quant failed.
I used LMStudio for testing.

[-]

ASYMT0TIC@reddit

Is it possible that these models simply respond much worse to quantization than other models do due to something unique about their architecture or training?

[-]

Dry_Parfait2606@reddit

They are designed to use the full fp16.. Thwy are probably trying to saturate fp16 at it fullest

[-]

harrro@reddit

4 bit doesn't mean 15 variables in this case because there is a multiplier applied over each 4 bit int that tries to get the precision back up to fp16ish levels at least.

I'm not an expert on quant and don't know the full details but the above is a massively simplified version of what happens with quantization.

[-]

Dry_Parfait2606@reddit

But the range gets worse when quantized, you trade precision for memory economics. That's why its fp16ish. You compress data and precision gets lost.

I know when LLMs began to work well ChatGPT3/3.5, and I will point out, that anything less then that performance will not be good enough for a decent performance...

You can have the smartest mouse on this planet, but network size can be a good predictor...

Just take Reddit. It's a network of say many people of a community... What if you would need to say, that you would only allow every 4th person that wats to say something on a post truly answer... Quality and capacity of the discussion, truth seeking, problems solving will be compromised, degraded.

Let's go jnto deeeep theory of information technology and theory.

Now: You have a request, you do a prompt... And there is only a 1 true perfect answer that can be generated out of the training data of the LLM That LLM is infinite in its size... You would get that answer... Now you have energy, memory limitations... The more you have to fit that thing into a smaller box, the more compromised you would be... You put pressure on thus system...

But now not only that you have fisical limitations for an LLM but you train it, FIRST to fit into a small network AND SECONDLY after training, you take that result that is a rappresentation of the training data and you degrade it...

I can with confidence, experimentally convey, that LLMs began to work at the ChatGPT 3/3,5 level... Before that it was not reliable enough...

You can give a group of the 100 most intelligent 5y/os all the money you want... They will not produce you a rocket... Well take maybe 12y/o..maybe...

Llama3 8b is good. (rather ok/sub-ok) But still not enough...

I hope I can finally try llama3 70b...

[-]

social_tech_10@reddit

You can try Llama3-70B for free at [https://duck.ai]

[-]

Didi_Midi@reddit

[This] is a very good summary of how quants work in GGUF.

[-]

CarpenterHopeful2898@reddit

good article

[-]

papipapi419@reddit

Even I was thinking the same thing

[-]

CortaCircuit@reddit

How does it compare to deepseek coder and lamma3 for coding?

[-]

ihaag@reddit

Deepseek coder V2 is much better for coding, it actually gave correct code were Gemma had so many mistakes

[-]

CortaCircuit@reddit

I have been using deepseek coder V1. Unfortunately, using my GPU doesnt have enough vram for V2.

[-]

ihaag@reddit

The gguf version is okay I used Q4 doesn’t seem as accurate but still much better than any of the others I’ve used

[-]

Seromelhor@reddit

To be honest, Deepseek (api) has the same quality as Gpt-4 for me in 97% of situations. Costing much less. (0.28 vs 15)

[-]

ab_drider@reddit

Does it generate dirty things for role play?

[-]

ZaggyChum@reddit

No, only black space nazi stories I'm afraid.

[-]

Biggest_Cans@reddit

We'll have to make do

[-]

Cyber-exe@reddit

Better off getting a server sleeve of P40's and running Grok

[-]

Biggest_Cans@reddit

Haven't been able to get any GGUFs to run in ooba or koboldcpp, so I dunno

[-]

mahadevbhakti@reddit

How's it for function calling, anyone tried?

[-]

sammcj@reddit

8K context = dead on arrival.

[-]

crazzydriver77@reddit

Can someone explain to me the hype around one of the many probability parrots??

[-]

Discordpeople@reddit (OP)

These advancements aren't just about bragging rights. They enable real-world applications like an offline real-time language translation and a tool that help students understand complex questions by providing insightful answers, not just canned responses. AI is constantly evolving, LLMs may not be able to solve all of our problems yet, but they represent a powerful tool with high potential. Don't just simply dismiss them as probability parrots, they are really useful.

[-]

Amgadoz@reddit

You are in the wrong sub.

[-]

papipapi419@reddit

Llama 3 used to fail on structured output tasks (Rolling window) given a document give a json output section wise (section : section_content) Paid APIs were getting too expensive Let’s hope this does it accurately enough

[-]

Eisenstein@reddit

For structured output, try SFR-Iterative, Codestral, Phi-3, or Qwen2. In my testing they all could output structured json if asked.

[-]

Specialist-2193@reddit

It is doing surprisingly well on multilingual tasks.

[-]

Western_Soil_4613@reddit

which?

[-]

F_Kal@reddit

I was particularly impressed with Greek - I think it's the first model that performs well

[-]

Specialist-2193@reddit

Non Latin languages

[-]

vasileer@reddit

Non Latin languages

I don't think you know all of them, so which ones have you tested?

[-]

PavelPivovarov@reddit

Tested in Russian and Japanese (to the best of my knowledge) seems also coherent. Saw confirmation about Slovenian (seems like Slavic languages in general are not a problem). Plus someone confirmed Uzbek...

[-]

LLMtwink@reddit

i tried ukrainian, can confirm🙋

[-]

privacyparachute@reddit

I've been using the 27B all day to translate JSON from English to Dutch. It was mostly fine, with a few very rare typos.

Why not great? It doesn't rewrite sentences to flow better, like some bigger models. You can kind of still see the grammar from the original language having an influence in how the sentence is ordered. But it's not invalid. It's readable. It's.. fine.

[-]

noneabove1182@reddit

Anyone know if the ollama models are updated to the fixed versions? i see that it was updated 10 hours ago but they didn't update their conversion code in the repo so unsure

/u/agntdrake ?

[-]

agntdrake@reddit

Do you know what got fixed in the other versions? We have our own conversion code for Ollama separate from Google/llama.cpp's scripts.

[-]

noneabove1182@reddit

the biggest change was setting the vocab using _set_vocab_llama_hf instead of _set_vocab_sentencepiece, that seems to have fixed the tokenization of special tokens

[-]

agntdrake@reddit

We just ended pushing some updates to the models if you want to re-pull.

[-]

de4dee@reddit

Having no system prompt is not good. It doesn't properly follow my system prompt when I give it as a user request..

It seems to love markdown format.. Lots of bolds in the reply.

[-]

PavelPivovarov@reddit

Yeah and my LLM telegram bit also constantly throwing markdown parsing errors at the gemma2 generated responses. Sweet!

[-]

LoSboccacc@reddit

tried it but I think something is wrong currently with quantized clients because it becomes incoherent quite fast for the things it has written it seemed quite good, albeit it's one of those model that insists on emojis so bleh.

[-]

candre23@reddit

If you're using llama.cpp (or something based on it), you're going to have a bad time until they implement SWA.

[-]

LLMtwink@reddit

it's awesome in multilingual, in particular ukrainian but ive seen other people claim it's great in other languages too, gemma 9b is sm better than the llama 3 70b in ukrainian its not even funny

[-]

StableLlama@reddit

Where are uncensored versions that are working well?

With a simple test I could trigger censorship easily, so I won't waste my time with it before that's fixed.

[-]

Alexandratang@reddit

I was initially really excited to run these models and see their capabilities for myself, based on what benchmarks were available at the time. Based on my own testing, however, I have not found the models to be impressive at all. I would even go as far as to call them a letdown, unfortunately. It's very unlikely that I will use them over any other freely available models at this point.

[-]

ambient_temp_xeno@reddit

It's local llm tradition for things to be broken for at least a day, and this seems to have been the case again.

[-]

Barry_22@reddit

Is it multilingual like llama or English-only?

[-]

neosinan@reddit

Yes they are multilingual, I've seen many test with more obscure languages with good results not just other big languages.

[-]

Plus_Complaint6157@reddit

multi

[-]

grigio@reddit

I tried the 9B q4_k_m with Ollama, it's good but sometimes it has problems with long responses, it writes many new lines and never ends the full response

[-]

Dry_Parfait2606@reddit

I found this... There is probably a problem when compressing LLMs

[-]

----_____---------@reddit

9b is legit. It's not necessarily better than llama 8b, but it feels broadly in the same category, which means that they are both impressive.

I don't get 27b though. It feels kind of the same as 9b? Often even felt worse when I was comparing them. I don't think it's the reported issues with quants, I was using them on lmsys, and I checked that one prompt with temperature zero gave the same answer as on google's AI studio. Anyway, it feels a bit disappointing. I'm not super familiar with llama 70b, but in a few comparisons it felt better.

I've found one case where 27b is better, at least. When I gave it a task to generate some boilerplate code based on a type, the result was a lot less goofy than from 9b and llama 8b.

[-]

Lightninghyped@reddit

Gemma2 27B is surprisingly good on multilingual tasks. At least for Korean, which were always bad on open source models. The text it generate is not only correct on grammar, it also has great semantic understanding too. The ability to understand the user's request is also outstanding. If this model is further tuned on more dataset, and have more context size, this will be the best korean open source model.

[-]

dimsumham@reddit

Agree. These models are cracked.