SGLang. Some problems, but significantly better performance compared to vLLM

Posted by Sadeghi85@reddit | LocalLLaMA | View on Reddit | 3 comments

I wanted to serve gemma-3-12b-it on single 3090, I found that highest quality quantized model to be this one: https://huggingface.co/abhishekchohan/gemma-3-12b-it-quantized-W4A16

 

Problem I had with vLLM was that 24GB vram wasn't enough for 32k context (fp8 kv cache quantization didn't work) and token generation was half the speed of gemma-2, so I tried SGLang.

 

But SGLang gave some errors when trying to load the above model, so I had to put these codes:

gemma3_causal.py

if "language_model" in name and name not in params_dict.keys():
    name = name.replace("language_model.", "")
if "multi_modal_projector" in name or "vision_tower" in name:
    continue

 

compressed_tensors.py

try:
    from vllm.model_executor.layers.quantization.base_config import QuantizeMethodBase
    from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod
    from vllm.model_executor.layers.quantization.gptq_marlin import (
        GPTQMarlinLinearMethod,
        GPTQMarlinMoEMethod,
    )
    from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
    from vllm.model_executor.layers.quantization.utils.marlin_utils import (
        check_marlin_supported,
    )
    from vllm.scalar_type import scalar_types


    from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
    W4A16SPARSE24_SUPPORTED_BITS, WNA16_SUPPORTED_BITS, CompressedTensors24,
    CompressedTensorsScheme, CompressedTensorsW4A16Sparse24,
    CompressedTensorsW8A8Fp8, CompressedTensorsW8A8Int8,
    CompressedTensorsW8A16Fp8, CompressedTensorsWNA16)


    VLLM_AVAILABLE = True
except ImportError as ex:
    print(ex)

    VLLM_AVAILABLE = False

    GPTQLinearMethod = MarlinLinearMethod = QuantizeMethodBase = Any

    class scalar_types:
        uint4b8 = "uint4b8"
        uint8b128 = "uint8b128"

 

It's weird that SGLang code feels incomplete. But I can now use 32k context with 24gb vram, kv cache quantization works, and the speed difference! 10 tps for vLLM compared to 46 tps for SGLang!

 

vLLM==0.8.2

SGLang==0.4.4.post3

 

One reason for slow speed with vLLM could be that latest version (0.8.2) can't work with latest Flashinfer beacause vLLM=0.8.2 requires torch==2.6 but Flashinfer requires torch==2.5.1

 

To load the model above, SGLang needs vLLM to be installed (compressed_tensors), but for the above reason (Flashinfer and torch version), SGLang==0.4.4.post3 needs vLLM<=0.7.3

 

No where this was mentioned so it was confusing at first.

 

I also tried online quantization on base gemma-3-12b-it using torchao config. It doesn't work with multimodal, so I changed the config.json to be text only. Then it works for low context, but with high context and kv cache quantization, the quality wasn't good. I also tried gptq model but it wasn't good either, persumably bacause it needs high quality dataset. So it seems the best quantization for gemma-3 is llmcompressor using ptq (no dataset) int4-w4a16