How Fast can I run models.
Posted by feelin-lonely-1254@reddit | LocalLLaMA | View on Reddit | 3 comments
I'm running image processing with gemma 3 27b and getting structured outputs as response, but my present pipeline is awfully slow (I use huggingface for the most part and lmformatenforcer), it processes a batch of 32 images in 5-10 minutes when I get a response of atmax 256 tokens per image. Now this is running on 4 A100 40 gig chips.
This seems awfully slow and suboptimal. Can people share some codebooks and benchmark times for image processing, and should I shift to sglang? I cannot use the latest version of VLLM in my uni's compute cluster.
Mr_Moonsilver@reddit
Support for batch processing capable engines like vllm is spotty for gemma 3. is there a specific reason you need to use that particular model? If not, mistral small 3.1 24b is a good alternative and there's a AWQ quant available. Using this should allow you to speed up your workflow considerably.
PermanentLiminality@reddit
With 160gb of VRAM you should be able to run several instances of Gemma 27b in parallel.
feelin-lonely-1254@reddit (OP)
I can, but presently im batching 32 images at a time and that takes 5 minutes to process, if I remember correctly, sequential processing lets me do 4 instances and still takes more time per image.
I've seen people claim that latest vllm can do 200 streams of 100 toks / sec (on each stream) on gemma 27b , and I'm no where close to such perf....just wanted to know what people generally observe.