Stepfun 3.7 Flash is very good

[-]

-dysangel-@reddit (OP)

I haven't tried a proper agentic test yet. If I had to just spitball it I'd say it feels in the ballpark of 27B performance, but much faster inference.

[-]

lolwutdo@reddit

As someone who use to use qwen 397b and 122b daily before switching to 3.6 27b, it’s gonna have to be magnitudes better than 27b to get me to switch back to a large model again.

27b is just an absolute beast, only thing that might switch me from it is a new 35b moe or 27b version.

[-]

No_Mango7658@reddit

If you have the space for 397b, why scared at stepfun? It's only 11b active parameters so it will be much faster than qwen3.6 27b

[-]

lolwutdo@reddit

because prompt processing speeds are more important to me than TG for agentic work; fully offloaded 27b gives me 2k t/s prompt processing, 397b can't even get a fourth of that speed.

[-]

Such_Advantage_6949@reddit

The 27b with dflash or mtp is very fast also. I get on average 100 tok/s running 8bpw

[-]

soyalemujica@reddit

I gave it a try, 27b it does a good job but struggles a bit with the flying to the sides, tested at Q6 and Q5 as well, it does an even better job with the scenery as well and the amount of details

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Wide_Big_6969@reddit

wow. How is it compared in inference speed to say dflash qwen 3.6 27b?

[-]

VampiroMedicado@reddit

I'm using it for an expenses registry now. I was thinkering last night (this morning) with how to do that.

Deepseek V4 Flash shit the bed, couldn't handle multi-tool calling.

Deepseek V3.2 surprisingly was alright, sometimes it shat the bed but overall "OK".

This one works flawlessly in all the test I did.

[-]

sloptimizer@reddit

The high quality AesSedai Q5_K_M quant fits perfectly in 5 consumer GPUs with 32GB each.

Solid 36 tps with mixed CUDA/ROCm setup using 5090 for attention and offloading MoE to R9700 (seems to be most efficient way to utilize most of the VRAM, avoiding context duplication across cards). Easily getting into 40 tps with ngram speculative decoding in agentic setup.

Can't wait for the MTP patches to land!

cd ~/Env/repos/llama.cpp/
./build/bin/llama-server \
    --alias Step-3.7-Flash \
    --model /models/AesSedai/Step-3.7-Flash-GGUF/Q5_K_M/Step-3.7-Flash-Q5_K_M-00001-of-00004.gguf \
    --mmproj /models/AesSedai/Step-3.7-Flash-GGUF/Q5_K_M/mmproj-Step-3.7-Flash-BF16.gguf \
    --no-mmap \
    --temp 0.8 --top-k 0 --top-p 1.0 --min-p 0.05 \
    --repeat-penalty 1.04 --repeat-last-n 256 \
    --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-draft-n-min 48 --spec-draft-n-max 64 \
    --ctx-size 201000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-4])\.ffn_.*=CUDA0" \
    -ot "blk\.([5-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(4[0-4])\.ffn_.*_exps.*=ROCm3" \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1

[-]

PigSlam@reddit

So how much RAM are we talking to fit this?

[-]

ProfessionalSpend589@reddit

Nice test. I was thinking of giving StepFun another chance, maybe I’ll download the new weights.

[-]

FoxiPanda@reddit

I've been messing around with the Q8 quant and I also agree that it is very good. One thing I've noted is how neat the code it writes is. It's all nicely sectioned, commented, formatted well, and is just generally pleasant. It's also pretty good in general conversation, too - it writes nicely formatted documentation and is generally pretty thorough.

[-]

LegacyRemaster@reddit

Qwen 3.7 max.... I hope 3.7 27b will be out sooooon

[-]

dzedaj@reddit

better than Qwen3.6-27B or not?

[-]

LegacyRemaster@reddit

It depends on what you need to do. I use qwen 3.6 27b with vscode+claude code, and with stepfun 3.7, unfortunately, my thought cycles are too long. So, it's true that I don't pay for tokens in terms of money, but I do pay for them in terms of time and waste waiting for responses.

[-]

Tr4sHCr4fT@reddit

I am old enough to remember the Excel flight simulator

[-]

LegacyRemaster@reddit

me too

[-]

some_user_2021@reddit

The Hall of Tortured Souls

[-]

zR0B3ry2VAiH@reddit

Oh my god..... Completely forgot

[-]

PowerBottomBear92@reddit

Chocks Away! vibes

[-]

op8040@reddit

I was tempted to pull it last night but didn’t. Waiting on better quants/vllm for GB10.

[-]

Miserable-Dare5090@reddit

Look at the GB10 user forum. Eugr vllm already supports stepfun

[-]

-dysangel-@reddit (OP)

They have an IQ3_XXS that would fit on GB10, I want to try that on mine too

[-]

coder543@reddit

Have you tried out MiMo V2.5? It seems quite good.

[-]

-dysangel-@reddit (OP)

Yeah pretty sure I had it for a few minutes and it didn't feel any better than Minimax M2.7 for the RAM so I just deleted it

[-]

coder543@reddit

MiMo is one of the highest ranked and most token-efficient models that I see on benchmarks, and it does feel good. Unlike Minimax M2.7, MiMo V2.5 is also multimodal like Step 3.7 Flash.

I'm still waiting to see Step 3.7 Flash on the Artificial Analysis benchmarks. I've tried it out some locally, and it seems fine, but it hasn't blown me away.

[-]

-dysangel-@reddit (OP)

It's possible llama.cpp didn't have solid support for Mimo 2.5 yet, I should probably give it another go sometime.

Stepfun 3.7 still isn't GLM quality, but the balance of prefill/decode/RAM/understanding feels like the best I have so far.

[-]

LosEagle@reddit

Which quant did that? I'm on Q2_K_XL and it feels like I can't do any better if I wanna stay at good enough t/s with half-decent context and even this is pushing it.

[-]

FullstackSensei@reddit

Doubt you'll do good either if someone chopped 3/4 of your brain.

Whatever your go to model is, how much do you do unattended? Or do you find you have to babysit the model because it constantly makes mistakes or deviates from what you're asking it to do?

[-]

LosEagle@reddit

lmao was this necessary? Yea I get it, it's heavily quantized. I have low VRAM, no need to ridicule me for that lol

[-]

thefooz@reddit

No one was ridiculing you. They were stating a fact. You seem to have projected your own feelings about your setup onto the person who responded.

They posted their experience with Q4 and you’re saying Q2 isn’t giving you the same experience. It’s like hiring Albert Einstein’s non-brilliant brother and asking why they can’t come up with the theory of relativity. Just because they share a good chunk of their DNA, doesn’t mean they’re nearly as capable.

[-]

LosEagle@reddit

I wasn't talking about experience with the Q2 because I didn't even get to try it yet on something other than quick curl testing of how well it actually runs. But I was wondering which quant OP used, because it seems like it did a good work and I wanted to compare I guess which quant can still make good enough job or something along those lines and I guess I included mine as reference of something I can run myself.

[-]

ambassadortim@reddit

It's ok. I understood what you were saying. Sometimes it's hard to understand text comments by others. I learned from the discussion thanks

[-]

thefooz@reddit

Fair enough, though they posted the quant in their original post.

[-]

FullstackSensei@reddit

I know you have low VRAM, hence my follow up question.

The point I'm trying to make is: stop chasing t/s and actually measure work done per unit of time. I find 3-4x lower t/s can often lead to something like 10x improvement in output, which in turn leads to getting quite a bit more done per unit of time, and at hugely lower stress levels.

[-]

LosEagle@reddit

Fair enough. That makes sense.

[-]

-dysangel-@reddit (OP)

That was Q4_K_S

[-]

Legitimate-Pumpkin@reddit

Give it water guns and a second player and take my money 💪

[-]

HavenTerminal_com@reddit

we really did go from hiding a flight sim in excel to just... this

[-]

technofox01@reddit

Did you make a game with this or is it flying the drone/airplane?

[-]

-dysangel-@reddit (OP)

I was controlling the plane. Since most models can handle tetris well these days, I've started asking them for things like a relaxing flight sim, or GTA-style 3D city etc. This was the most solid attempt at a flight sim that I've seen so far - really decent graphics and didn't require much feedback to get the controls sensible.

[-]

Panthau@reddit

You already mentioned the quant, maybe you can give some more infos like system, prompt, t/s, how many prompts or how long did you work on that, how many bug fixing was needed, etc.

[-]

-dysangel-@reddit (OP)

This is running on my M3 Ultra. The generated world was like this first try, but I had to do some iteration on the keyboard controls so that it would pitch in local space instead of world space, and also mentioned the prop was facing the wrong way (which it fixed with no more details from me - though you can see the cockpit glass is still the wrong way). So about 30 seconds of input on my part, maybe 10 minutes for generation time?

I more just want to raise awareness of the model, since if I've ever tested a Stepfun model before, I didn't think it was worth the drive space. But this one seems pretty special.