local-gemma: Gemma 2 optimized for your local machine

Posted by hackerllama@reddit | LocalLLaMA | View on Reddit | 37 comments

Hey all! This is Omar, Chief Llama Officer at Hugging Face, ready to talk about our latest project, `local-gemma` (https://github.com/huggingface/local-gemma) A common feedback we receive in transformers is that picking the right parameters and settings for your use case is not obvious. Hence, we release a first `local-gemma` repo which hopefully helps patch this up! * CLI and Python usage * Automatic preset based on your hardware and trading off between speed, memory, and accuracy * **Exact**: maximizes accuracy. 18.3GB for 9B, 68.2GB for 27B. * **Memory**: uses 4-bit quantization. 7.3GB for 9B, 17GB for 27B. * Memory Extreme: uses CPU offloading. 3.7GB for 9B, 4.7GB for 27B * Easy to install with pip and pipx * Works with CUDA, MPS, AND cpu * This uses logit soft-capping, which means you won't get the weird results some folks are getting with the 27B This is a first experiment to make it easier for folks to run models locally with transformers and get good generation results. Feel free to leave feedback as issues in the repo. Enjoy!