stepfun-ai/Step3-VL-10B · Hugging Face

[-]

Chromix_@reddit

That's quite a step up compared to the larger models. Unfortunately there's no llama.cpp support yet, but given the model size it should run somewhat OK as-is with transformers on a 24 GB VRAM GPU.

[-]

We're very keen on adding llama.cpp support, but our small team is currently at full capacity. We're aiming for early February. We highly encourage community contributions and would love to collaborate with anyone interested in leading this effort!

[-]

beneath_steel_sky@reddit

Merged and available in the latest binary https://github.com/ggml-org/llama.cpp/pull/21287

[-]

Jazzlike-Result-2330@reddit

They've already created a gguf file that can be used in lmstudio

[-]

ZealousidealBadger47@reddit

Any link. I used https://huggingface.co/seanbailey518/Step3-VL-10B-GGUF , it is not working in LMstudio

[-]

McVitas@reddit

is there a quantized version of this one?

[-]

LegacyRemaster@reddit

Tested on rtx 6000 96gb. Very very very slow.

10 tokens/sec. Not bad for a 8k video card!

C:\llm>python teststep.py

CUDA available: True

GPU name: NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Total GPU memory: 95.59 GB

Torchvision version: 0.25.0.dev20260115+cu128

[-]

vidibuzz@reddit

Something looks very fishy there. Not worth installing if performance is that bad.

[-]

AvocadoArray@reddit

There’s no way, those are CPU numbers for a 10B model. Or is there something about this model architecture that makes inference slow?

[-]

LegacyRemaster@reddit

100% Gpu

[-]

Loskas2025@reddit

I read "GPU memory used"...

[-]

RnRau@reddit

What inference engines support this one?

[-]

bronkape_@reddit

vllm https://github.com/vllm-project/vllm/pull/32329

[-]

FullOf_Bad_Ideas@reddit

One of the first VLMs, if not the first one, to use Meta's PE as a vision encoder.

[-]

Maximum@reddit

So the catch is more inference time and VRAM for context? It's actually not a bad trade-off if it scales. There are many problems for which I am willing to wait if the quality of the answer is better.

[-]

SlowFail2433@reddit

Yes test-time compute is usually a fairly decent trade-off TBH

[-]

SlowFail2433@reddit

Parallel Coordinated Reasoning (PaCoRe) is the main novelty I think. Also uses Perception Encoder from Meta which is strong

[-]

Alpacaaea@reddit

Is it really that hard to make a not horrible graph?

[-]

kaisurniwurer@reddit

Seeing as your post is "controversial" I assume there is a lot of personal preference in play here.

I like this one, to me it's more readable than colors while highlighting the model in question.

[-]

Top_Necessary7623@reddit

vllm

[-]

TheRealMasonMac@reddit

This actually looks like a good graph though. It doesn't distort the relative difference and it's easy to tell which model is which.

[-]

foldl-li@reddit

This is terrible. It drove me crazy when reading it. I don't know why, and my brain just felt hard to extract any information from it.

[-]

Alpacaaea@reddit

I meant more that the other models are all grey

[-]

silenceimpaired@reddit

Grey with patterns… at a glance you can see how this model compares against all other models… and with a closer look you can compare against a specific model. Sure they could have added more colors but then you have to hunt and peck for the model being compared and it would look a. Little garish.

[-]