Sarvam-30b-quantized - Need 1-bit version GGUF
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 17 comments
Randomly I came across this 1-bit version of 30B model. I remember that some of us want to see medium/big size 1-bit version models. Here one. so somebody please create 1-bit version GGUF, we can run something bigger with tiny/small VRAM. Thanks
Overview
This repository contains an ultra-quantized version of the Sarvam-30B model, achieving a 27.6x compression ratio from the original FP16 size (\~128.61 GB) to approximately 4.34 GB.
- Original Model: sarvamai/sarvam-30b
- Quantization Method: Custom 1-bit quantization with HQQ (Half-Quadratic Quantization)
- Target Size: <5GB (achieved: 4.34 GB)
- Compression Ratio: 27.6x
Quantization Details
Method
This model uses a custom 1-bit quantization scheme optimized for the Sarvam-30B architecture:
- Weight Quantization: Weights are quantized to 1-bit using a custom binary quantization with learned scales
- Scale Storage: Per-channel scales are stored in FP16 for dequantization
- Expert Routing: MoE routing weights preserved at higher precision for accuracy
Compression Breakdown
| Component | Original Size | Quantized Size | Compression |
|---|---|---|---|
| Model Weights | \~128.61 GB | \~4.34 GB | 27.6x |
| Total (with metadata) | \~128.61 GB | \~4.65 GB | 27.6x |
Performance Metrics
Compression Achieved
| Metric | Value |
|---|---|
| Original FP16 Size | \~128.61 GB |
| Quantized Size | 4.34 GB |
| Compression Ratio | 27.6x |
| Target (<5GB) | ✓ Achieved |
Inference Performance
- Memory Usage: \~5-6GB VRAM for inference (vs \~60GB for FP16)
- Latency: \~2-3x slower than FP16 due to dequantization overhead
- Throughput: Suitable for batch processing and edge deployment
Quality Metrics
The quantized model maintains near-original performance:
- Perplexity: Within 5-10% of original FP16 model
- BLEU Score: \~95% of original on translation tasks
- Human Evaluation: Output quality rated as "almost similar" to full precision
Limitations
- Custom Format: This is a custom 1-bit quantization format, not standard GGUF or GPTQ
- Dequantization Required: Runtime dequantization adds computational overhead
- Hardware Requirements: Requires CUDA-capable GPU for efficient inference
- Not for Fine-tuning: Quantized weights are not suitable for further training
butlan@reddit
if this quant type promise good result, llama.cpp implementation is easy, normally if is this real 1 bit, should not output nothing meaningful without training. I will check this.
Sufficient-Bid3874@reddit
Why can't you make one? There is a script bundled with llama.cpp
pmttyji@reddit (OP)
I'm not a professional(not even a techie .... only last year I came to llama.cpp from koboldcpp/Jan). Also this is 1-bit version. Don't want to embarrass myself.
Sufficient-Bid3874@reddit
Here before they realise that it's already quantized and merely needs to be put into a GGUF format
pmttyji@reddit (OP)
You really put lot of confidence on me dude. Sorry to disappoint you.
Sufficient-Bid3874@reddit
Sorry if I came across as rough. I just meant that it would be faster to make one urself, as they already quantised it so it’s really simple. However, u/noctrex is right, not supported in llamacpp rn
pmttyji@reddit (OP)
Yeah, now this thread got buried. Now onwards No one won't know about this 30B 1-bit version model.
Sufficient-Bid3874@reddit
I didn't downvote your post btw
pmttyji@reddit (OP)
Unfortunately some unintentionally did that. No big deal. And I don't care about karma or downvotes.
Now I'm looking for other ways to run this. Currently I don't have my current laptop(went out for repair, display issue last week). Otherwise I would've tried TextGen(oobabooga) which supports safetensor through Transformers backend. Jan too possibly. I'll check it by this weekend, will post a thread if it's working.
Myself posted more than bunch of low-effort or less useful threads here in this sub in last 1+ year. But this thread is not one of those. I really wanted to get GGUF of this model ASAP.
Sufficient-Bid3874@reddit
I would not trust the benchmarks. Why not use a more mainstream model?
pmttyji@reddit (OP)
Current laptop(8GB VRAM) can't run anything big. New rig getting delayed again(hoped this month, looks like 1st half of coming month).
Still want to run 1-bit version of 500B/1T models this/next year onwards :D .... That's why wanted to try 1-bit versions of 30B models first :)
Look_0ver_There@reddit
The GGUF conversation needs to know about the specific inference model being used before it can convert it. There's plenty of models that exist that it can't convert from SafeTensor to GGUF. It's only 4GB. Give it a try and report back what happens.
Sufficient-Bid3874@reddit
Fair, I was wrong, but in my defence, I hadn't read the full post at the time. I assumed since they were posting a BitNet-style model, the arch would already be supported in llama.cpp, as I was assuming they wanted this bitnet since they could not run the full model. Reasonable assumption given the context, just turned out not to be the case yet
Healthy-Nebula-3603@reddit
Uhhh so unless...
pmttyji@reddit (OP)
u/noctrex please create GGUF for this if possible. Thanks
noctrex@reddit
That would not be possible for this specific repository. I could quantize the original model in normal quantizations. This specific one would not be supported in llama.cpp. As they say in the text:
pmttyji@reddit (OP)
Oops. Though I noticed that line, I thought they meant typical quants(from Q8-Q1).
If no GGUF/GPTQ, then what inference engines gonna support this model?
Hoped to try GGUF of this, my bad *sigh*
Thanks dude