Sarvam-30b-quantized - Need 1-bit version GGUF

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 17 comments

Randomly I came across this 1-bit version of 30B model. I remember that some of us want to see medium/big size 1-bit version models. Here one. so somebody please create 1-bit version GGUF, we can run something bigger with tiny/small VRAM. Thanks

Overview

This repository contains an ultra-quantized version of the Sarvam-30B model, achieving a 27.6x compression ratio from the original FP16 size (\~128.61 GB) to approximately 4.34 GB.

Original Model: sarvamai/sarvam-30b
Quantization Method: Custom 1-bit quantization with HQQ (Half-Quadratic Quantization)
Target Size: <5GB (achieved: 4.34 GB)
Compression Ratio: 27.6x

Quantization Details

Method

This model uses a custom 1-bit quantization scheme optimized for the Sarvam-30B architecture:

Weight Quantization: Weights are quantized to 1-bit using a custom binary quantization with learned scales
Scale Storage: Per-channel scales are stored in FP16 for dequantization
Expert Routing: MoE routing weights preserved at higher precision for accuracy

Compression Breakdown

Component	Original Size	Quantized Size	Compression
Model Weights	\~128.61 GB	\~4.34 GB	27.6x
Total (with metadata)	\~128.61 GB	\~4.65 GB	27.6x

Performance Metrics

Compression Achieved

Metric	Value
Original FP16 Size	\~128.61 GB
Quantized Size	4.34 GB
Compression Ratio	27.6x
Target (<5GB)	✓ Achieved

Inference Performance

Memory Usage: \~5-6GB VRAM for inference (vs \~60GB for FP16)
Latency: \~2-3x slower than FP16 due to dequantization overhead
Throughput: Suitable for batch processing and edge deployment

Quality Metrics

The quantized model maintains near-original performance:

Perplexity: Within 5-10% of original FP16 model
BLEU Score: \~95% of original on translation tasks
Human Evaluation: Output quality rated as "almost similar" to full precision

Limitations

Custom Format: This is a custom 1-bit quantization format, not standard GGUF or GPTQ
Dequantization Required: Runtime dequantization adds computational overhead
Hardware Requirements: Requires CUDA-capable GPU for efficient inference
Not for Fine-tuning: Quantized weights are not suitable for further training

[-]

butlan@reddit

if this quant type promise good result, llama.cpp implementation is easy, normally if is this real 1 bit, should not output nothing meaningful without training. I will check this.

[-]

Sufficient-Bid3874@reddit

Why can't you make one? There is a script bundled with llama.cpp

[-]

pmttyji@reddit (OP)

I'm not a professional(not even a techie .... only last year I came to llama.cpp from koboldcpp/Jan). Also this is 1-bit version. Don't want to embarrass myself.

[-]

Sufficient-Bid3874@reddit

Here before they realise that it's already quantized and merely needs to be put into a GGUF format

[-]

pmttyji@reddit (OP)

You really put lot of confidence on me dude. Sorry to disappoint you.

[-]

Sufficient-Bid3874@reddit

Sorry if I came across as rough. I just meant that it would be faster to make one urself, as they already quantised it so it’s really simple. However, u/noctrex is right, not supported in llamacpp rn

[-]

pmttyji@reddit (OP)

Yeah, now this thread got buried. Now onwards No one won't know about this 30B 1-bit version model.

[-]

Sufficient-Bid3874@reddit

I didn't downvote your post btw

[-]

pmttyji@reddit (OP)

Unfortunately some unintentionally did that. No big deal. And I don't care about karma or downvotes.

Now I'm looking for other ways to run this. Currently I don't have my current laptop(went out for repair, display issue last week). Otherwise I would've tried TextGen(oobabooga) which supports safetensor through Transformers backend. Jan too possibly. I'll check it by this weekend, will post a thread if it's working.

Myself posted more than bunch of low-effort or less useful threads here in this sub in last 1+ year. But this thread is not one of those. I really wanted to get GGUF of this model ASAP.

[-]

Sufficient-Bid3874@reddit

I would not trust the benchmarks. Why not use a more mainstream model?

[-]

pmttyji@reddit (OP)

Current laptop(8GB VRAM) can't run anything big. New rig getting delayed again(hoped this month, looks like 1st half of coming month).

Still want to run 1-bit version of 500B/1T models this/next year onwards :D .... That's why wanted to try 1-bit versions of 30B models first :)

[-]

Look_0ver_There@reddit

The GGUF conversation needs to know about the specific inference model being used before it can convert it. There's plenty of models that exist that it can't convert from SafeTensor to GGUF. It's only 4GB. Give it a try and report back what happens.

[-]

Sufficient-Bid3874@reddit

Fair, I was wrong, but in my defence, I hadn't read the full post at the time. I assumed since they were posting a BitNet-style model, the arch would already be supported in llama.cpp, as I was assuming they wanted this bitnet since they could not run the full model. Reasonable assumption given the context, just turned out not to be the case yet

[-]

Healthy-Nebula-3603@reddit

Uhhh so unless...

[-]

pmttyji@reddit (OP)

u/noctrex please create GGUF for this if possible. Thanks

[-]

noctrex@reddit

That would not be possible for this specific repository. I could quantize the original model in normal quantizations. This specific one would not be supported in llama.cpp. As they say in the text:

Custom Format: This is a custom 1-bit quantization format, not standard GGUF or GPTQ

[-]

pmttyji@reddit (OP)

Oops. Though I noticed that line, I thought they meant typical quants(from Q8-Q1).

If no GGUF/GPTQ, then what inference engines gonna support this model?

Hoped to try GGUF of this, my bad *sigh*

Thanks dude