Sarvam-30b-quantized - Need 1-bit version GGUF

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 17 comments

Randomly I came across this 1-bit version of 30B model. I remember that some of us want to see medium/big size 1-bit version models. Here one. so somebody please create 1-bit version GGUF, we can run something bigger with tiny/small VRAM. Thanks

Overview

This repository contains an ultra-quantized version of the Sarvam-30B model, achieving a 27.6x compression ratio from the original FP16 size (\~128.61 GB) to approximately 4.34 GB.

Quantization Details

Method

This model uses a custom 1-bit quantization scheme optimized for the Sarvam-30B architecture:

  1. Weight Quantization: Weights are quantized to 1-bit using a custom binary quantization with learned scales
  2. Scale Storage: Per-channel scales are stored in FP16 for dequantization
  3. Expert Routing: MoE routing weights preserved at higher precision for accuracy

Compression Breakdown

Component Original Size Quantized Size Compression
Model Weights \~128.61 GB \~4.34 GB 27.6x
Total (with metadata) \~128.61 GB \~4.65 GB 27.6x

Performance Metrics

Compression Achieved

Metric Value
Original FP16 Size \~128.61 GB
Quantized Size 4.34 GB
Compression Ratio 27.6x
Target (<5GB) ✓ Achieved

Inference Performance

Quality Metrics

The quantized model maintains near-original performance:

Limitations

  1. Custom Format: This is a custom 1-bit quantization format, not standard GGUF or GPTQ
  2. Dequantization Required: Runtime dequantization adds computational overhead
  3. Hardware Requirements: Requires CUDA-capable GPU for efficient inference
  4. Not for Fine-tuning: Quantized weights are not suitable for further training