SGLang. Some problems, but significantly better performance compared to vLLM
Posted by Sadeghi85@reddit | LocalLLaMA | View on Reddit | 3 comments
I wanted to serve gemma-3-12b-it on single 3090, I found that highest quality quantized model to be this one: https://huggingface.co/abhishekchohan/gemma-3-12b-it-quantized-W4A16
Problem I had with vLLM was that 24GB vram wasn't enough for 32k context (fp8 kv cache quantization didn't work) and token generation was half the speed of gemma-2, so I tried SGLang.
But SGLang gave some errors when trying to load the above model, so I had to put these codes:
gemma3_causal.py
if "language_model" in name and name not in params_dict.keys():
name = name.replace("language_model.", "")
if "multi_modal_projector" in name or "vision_tower" in name:
continue
compressed_tensors.py
try:
from vllm.model_executor.layers.quantization.base_config import QuantizeMethodBase
from vllm.model_executor.layers.quantization.gptq import GPTQLinearMethod
from vllm.model_executor.layers.quantization.gptq_marlin import (
GPTQMarlinLinearMethod,
GPTQMarlinMoEMethod,
)
from vllm.model_executor.layers.quantization.marlin import MarlinLinearMethod
from vllm.model_executor.layers.quantization.utils.marlin_utils import (
check_marlin_supported,
)
from vllm.scalar_type import scalar_types
from vllm.model_executor.layers.quantization.compressed_tensors.schemes import (
W4A16SPARSE24_SUPPORTED_BITS, WNA16_SUPPORTED_BITS, CompressedTensors24,
CompressedTensorsScheme, CompressedTensorsW4A16Sparse24,
CompressedTensorsW8A8Fp8, CompressedTensorsW8A8Int8,
CompressedTensorsW8A16Fp8, CompressedTensorsWNA16)
VLLM_AVAILABLE = True
except ImportError as ex:
print(ex)
VLLM_AVAILABLE = False
GPTQLinearMethod = MarlinLinearMethod = QuantizeMethodBase = Any
class scalar_types:
uint4b8 = "uint4b8"
uint8b128 = "uint8b128"
It's weird that SGLang code feels incomplete. But I can now use 32k context with 24gb vram, kv cache quantization works, and the speed difference! 10 tps for vLLM compared to 46 tps for SGLang!
vLLM==0.8.2
SGLang==0.4.4.post3
One reason for slow speed with vLLM could be that latest version (0.8.2) can't work with latest Flashinfer beacause vLLM=0.8.2 requires torch==2.6 but Flashinfer requires torch==2.5.1
To load the model above, SGLang needs vLLM to be installed (compressed_tensors), but for the above reason (Flashinfer and torch version), SGLang==0.4.4.post3 needs vLLM<=0.7.3
No where this was mentioned so it was confusing at first.
I also tried online quantization on base gemma-3-12b-it using torchao config. It doesn't work with multimodal, so I changed the config.json to be text only. Then it works for low context, but with high context and kv cache quantization, the quality wasn't good. I also tried gptq model but it wasn't good either, persumably bacause it needs high quality dataset. So it seems the best quantization for gemma-3 is llmcompressor using ptq (no dataset) int4-w4a16
BABA_yaaGa@reddit
I want to run qwen 2.5 vl 32B locally on a single 3090 , using vllm, with sufficient context length to enable video inference but so far I haven't had any lunch. Any help would be appreciated.
SouvikMandal@reddit
Try the awq model?
BABA_yaaGa@reddit
Using awq, still not able to run