local-gemma: Gemma 2 optimized for your local machine

Posted by hackerllama@reddit | LocalLLaMA | View on Reddit | 37 comments

Hey all! This is Omar, Chief Llama Officer at Hugging Face, ready to talk about our latest project, `local-gemma` (https://github.com/huggingface/local-gemma) A common feedback we receive in transformers is that picking the right parameters and settings for your use case is not obvious. Hence, we release a first `local-gemma` repo which hopefully helps patch this up! * CLI and Python usage * Automatic preset based on your hardware and trading off between speed, memory, and accuracy * **Exact**: maximizes accuracy. 18.3GB for 9B, 68.2GB for 27B. * **Memory**: uses 4-bit quantization. 7.3GB for 9B, 17GB for 27B. * Memory Extreme: uses CPU offloading. 3.7GB for 9B, 4.7GB for 27B * Easy to install with pip and pipx * Works with CUDA, MPS, AND cpu * This uses logit soft-capping, which means you won't get the weird results some folks are getting with the 27B This is a first experiment to make it easier for folks to run models locally with transformers and get good generation results. Feel free to leave feedback as issues in the repo. Enjoy!

Reply to Post

37 Comments

[-]

kryptkpr@reddit

This is nice but should we maybe talk about the elephant in the room: why is 9B getting 17 tok/sec on an A100? This is abysmal.

[-]

hackerllama@reddit (OP)

Due to the model using logit soft capping, it means SDPA and Flash Attention are not compatible with Gemma 2. torch.compile also does not work out of the box, yet. This means that a bunch of the optimizations built in the ecosystem will not work for Gemma 2 for now.

[-]

MoffKalast@reddit

The Google mentione that the sliding window was used for " inference speed, while maintaining long context performance" meanwhile they absolutely demolished all performance with the stupid soft capping lmao.

[-]

jkflying@reddit

That's just due to the engines we have right now not being optimised for it, not a fundamental limitation of the model. I'm sure their internal engines support this optimisation properly, and it will be added to all of the tools in due course.

[-]

MoffKalast@reddit

Well as long as "in due course" happens before another major model release that obsoletes Gemma-2, in which case it won't be because nobody will care anymore.

[-]

kryptkpr@reddit

Ahh so it all comes back to logit soft capping. I would replace my use of L3-8B-Instruct with G2-9B-It in a heartbeat if it wasn't ~1/3rd the speed.

[-]

DeltaSqueezer@reddit

Try vLLM, they remove the soft cap: https://github.com/vllm-project/vllm/pull/5908

[-]

kryptkpr@reddit

I just read over that PR - it says they are "planning to" include the fix, but it looks like current state of 27b is broken AF?

[-]

DeltaSqueezer@reddit

TBH, I never tried any Gemma model.

[-]

kryptkpr@reddit

Sweet, I'll pick this up when they release 0.5.1

[-]

BestSentence4868@reddit

Played around with this for a whole day building some inference containers, and I think it is great, but authentication needs an overhaul and maybe so does "auto". "auto" mode shouldn't go to "memory" for on a GPU with 24GB VRAM for 9B if "exact" is faster. Authentication in code with token=hf_token doesn't work unless you use subprocess.run("local-gemma", "--token", hf_token, "What is the capital of France") I'd love it if those got fixed.

[-]

jgante@reddit

Hey there! Maintainer here :D login should be simpler now, and some 24GB GPUs were incorrectly mapping to the `memory` preset. Try installing and running from `main` (e.g. `pipx install git+https://github.com/huggingface/local-gemma.git`)

[-]

JadeSerpant@reddit

How does this (which uses transformers I guess) compare to llama.cpp in terms of speed? Let's say on an M1 macbook pro.

[-]

ThickLetteread@reddit

How good is llama.cpp on your machine? Is it a standalone locally run setup? Id like to try it on mine. How much space is needed?

[-]

Robert__Sinclair@reddit

Still the best way is using llama.cpp and for a few reasons: 1) with llama.cpp I can quantize "my way" which is different than the usual way of quantizing all tensors in the same way. 2) llama.cpp is way faster if you want to test my quants: [ZeroWw/gemma-2-9b-it-GGUF at main (huggingface.co)](https://huggingface.co/ZeroWw/gemma-2-9b-it-GGUF/tree/main)

[-]

AlphaLemonMint@reddit

llamacpp does not support bfloat16 inference.

[-]

Robert__Sinclair@reddit

I haven't seen any difference yet between bf16 and f16.

[-]

AlphaLemonMint@reddit

The Gemma 2 27B fully utilizes the resolution of BF16, so FP16 inference will cause problems.

[-]

Robert__Sinclair@reddit

like what? I haven't seen any so far.. but it's also true I don't use it much, and if I do I use the 7B.

[-]

smcnally@reddit

Your quants are working with recent llama-server builds? Do you have parameter recommendations?

[-]

Winter_Importance436@reddit

No rocm/oneapi support 😢?

[-]

nborwankar@reddit

What is the degradation in inference with memory extreme - how does one measure it? How does one experience it if we don’t want to do benchmarks but want to see for which use cases it still behaves good enough - any experiments you conducted to see what happens?

[-]

jgante@reddit

memory_extreme has nearly the same model quality, according to the benchmarks we ran (on both 9b and 27b models) :) See here: https://github.com/huggingface/local-gemma?tab=readme-ov-file#presets

[-]

a_beautiful_rhind@reddit

68gb for a 27b? What is it in, FP32?

[-]

mikael110@reddit

No, it's in bfloat16. 68GB is normal for a full 27B model. At FP32 it would be significantly larger than that. It's easy to loose track of just how huge LLMs really are when you typically run them quantized.

[-]

hackerllama@reddit (OP)

27 billion parameters in 16 bits 432 billion bits 54 billion bytes 54 Gigabyte just to be able to load the model

[-]

local-gemma: Gemma 2 optimized for your local machine

Reply to Post

37 Comments

kryptkpr@reddit

hackerllama@reddit (OP)

MoffKalast@reddit

jkflying@reddit

MoffKalast@reddit

kryptkpr@reddit

DeltaSqueezer@reddit

kryptkpr@reddit

DeltaSqueezer@reddit

kryptkpr@reddit

BestSentence4868@reddit

jgante@reddit

JadeSerpant@reddit

ThickLetteread@reddit

Robert__Sinclair@reddit

AlphaLemonMint@reddit

Robert__Sinclair@reddit

AlphaLemonMint@reddit

Robert__Sinclair@reddit

smcnally@reddit

Winter_Importance436@reddit

nborwankar@reddit

jgante@reddit

a_beautiful_rhind@reddit

mikael110@reddit

hackerllama@reddit (OP)

93041025@reddit

kristaller486@reddit

Able-Locksmith-1979@reddit

cleverusernametry@reddit

crazymonezyy@reddit

CapitalNobody6687@reddit

Biggest_Cans@reddit

Majestical-psyche@reddit

SanDiegoDude@reddit

gofiend@reddit

bgighjigftuik@reddit