Stepfun 3.7 Flash is very good
Posted by -dysangel-@reddit | LocalLLaMA | View on Reddit | 48 comments
If you can fit Stepfun 3.7 Flash into RAM, try it! It's feeling close to GLM 5.1 quality in terms of aesthetics, and around 80% in terms of 3D world understanding.
However since it's only 25% of the params of GLM 5.1, and it has built in, it's feeling like nothing else comes close for the RAM just now.
Kazushi998@reddit
how close to qwen 3.6 27b?
Such_Advantage_6949@reddit
is it better than qwen3.6 27B
-dysangel-@reddit (OP)
I haven't tried a proper agentic test yet. If I had to just spitball it I'd say it feels in the ballpark of 27B performance, but much faster inference.
lolwutdo@reddit
As someone who use to use qwen 397b and 122b daily before switching to 3.6 27b, it’s gonna have to be magnitudes better than 27b to get me to switch back to a large model again.
27b is just an absolute beast, only thing that might switch me from it is a new 35b moe or 27b version.
No_Mango7658@reddit
If you have the space for 397b, why scared at stepfun? It's only 11b active parameters so it will be much faster than qwen3.6 27b
lolwutdo@reddit
because prompt processing speeds are more important to me than TG for agentic work; fully offloaded 27b gives me 2k t/s prompt processing, 397b can't even get a fourth of that speed.
oxygen_addiction@reddit
5090RTX?
No_Mango7658@reddit
That's a good reason
Such_Advantage_6949@reddit
The 27b with dflash or mtp is very fast also. I get on average 100 tok/s running 8bpw
soyalemujica@reddit
I gave it a try, 27b it does a good job but struggles a bit with the flying to the sides, tested at Q6 and Q5 as well, it does an even better job with the scenery as well and the amount of details
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Wide_Big_6969@reddit
wow. How is it compared in inference speed to say dflash qwen 3.6 27b?
VampiroMedicado@reddit
I'm using it for an expenses registry now. I was thinkering last night (this morning) with how to do that.
Deepseek V4 Flash shit the bed, couldn't handle multi-tool calling.
Deepseek V3.2 surprisingly was alright, sometimes it shat the bed but overall "OK".
This one works flawlessly in all the test I did.
sloptimizer@reddit
The high quality AesSedai Q5_K_M quant fits perfectly in 5 consumer GPUs with 32GB each.
Solid 36 tps with mixed CUDA/ROCm setup using 5090 for attention and offloading MoE to R9700 (seems to be most efficient way to utilize most of the VRAM, avoiding context duplication across cards). Easily getting into 40 tps with ngram speculative decoding in agentic setup.
Can't wait for the MTP patches to land!
PigSlam@reddit
So how much RAM are we talking to fit this?
ProfessionalSpend589@reddit
Nice test. I was thinking of giving StepFun another chance, maybe I’ll download the new weights.
FoxiPanda@reddit
I've been messing around with the Q8 quant and I also agree that it is very good. One thing I've noted is how neat the code it writes is. It's all nicely sectioned, commented, formatted well, and is just generally pleasant. It's also pretty good in general conversation, too - it writes nicely formatted documentation and is generally pretty thorough.
LegacyRemaster@reddit
Qwen 3.7 max.... I hope 3.7 27b will be out sooooon
dzedaj@reddit
better than Qwen3.6-27B or not?
LegacyRemaster@reddit
It depends on what you need to do. I use qwen 3.6 27b with vscode+claude code, and with stepfun 3.7, unfortunately, my thought cycles are too long. So, it's true that I don't pay for tokens in terms of money, but I do pay for them in terms of time and waste waiting for responses.
Tr4sHCr4fT@reddit
I am old enough to remember the Excel flight simulator
LegacyRemaster@reddit
me too
some_user_2021@reddit
The Hall of Tortured Souls
zR0B3ry2VAiH@reddit
Oh my god..... Completely forgot
PowerBottomBear92@reddit
Chocks Away! vibes
op8040@reddit
I was tempted to pull it last night but didn’t. Waiting on better quants/vllm for GB10.
Miserable-Dare5090@reddit
Look at the GB10 user forum. Eugr vllm already supports stepfun
-dysangel-@reddit (OP)
They have an IQ3_XXS that would fit on GB10, I want to try that on mine too
coder543@reddit
Have you tried out MiMo V2.5? It seems quite good.
-dysangel-@reddit (OP)
Yeah pretty sure I had it for a few minutes and it didn't feel any better than Minimax M2.7 for the RAM so I just deleted it
coder543@reddit
MiMo is one of the highest ranked and most token-efficient models that I see on benchmarks, and it does feel good. Unlike Minimax M2.7, MiMo V2.5 is also multimodal like Step 3.7 Flash.
I'm still waiting to see Step 3.7 Flash on the Artificial Analysis benchmarks. I've tried it out some locally, and it seems fine, but it hasn't blown me away.
-dysangel-@reddit (OP)
It's possible llama.cpp didn't have solid support for Mimo 2.5 yet, I should probably give it another go sometime.
Stepfun 3.7 still isn't GLM quality, but the balance of prefill/decode/RAM/understanding feels like the best I have so far.
LosEagle@reddit
Which quant did that? I'm on Q2_K_XL and it feels like I can't do any better if I wanna stay at good enough t/s with half-decent context and even this is pushing it.
FullstackSensei@reddit
Doubt you'll do good either if someone chopped 3/4 of your brain.
Whatever your go to model is, how much do you do unattended? Or do you find you have to babysit the model because it constantly makes mistakes or deviates from what you're asking it to do?
LosEagle@reddit
lmao was this necessary? Yea I get it, it's heavily quantized. I have low VRAM, no need to ridicule me for that lol
thefooz@reddit
No one was ridiculing you. They were stating a fact. You seem to have projected your own feelings about your setup onto the person who responded.
They posted their experience with Q4 and you’re saying Q2 isn’t giving you the same experience. It’s like hiring Albert Einstein’s non-brilliant brother and asking why they can’t come up with the theory of relativity. Just because they share a good chunk of their DNA, doesn’t mean they’re nearly as capable.
LosEagle@reddit
I wasn't talking about experience with the Q2 because I didn't even get to try it yet on something other than quick curl testing of how well it actually runs. But I was wondering which quant OP used, because it seems like it did a good work and I wanted to compare I guess which quant can still make good enough job or something along those lines and I guess I included mine as reference of something I can run myself.
ambassadortim@reddit
It's ok. I understood what you were saying. Sometimes it's hard to understand text comments by others. I learned from the discussion thanks
thefooz@reddit
Fair enough, though they posted the quant in their original post.
FullstackSensei@reddit
I know you have low VRAM, hence my follow up question.
The point I'm trying to make is: stop chasing t/s and actually measure work done per unit of time. I find 3-4x lower t/s can often lead to something like 10x improvement in output, which in turn leads to getting quite a bit more done per unit of time, and at hugely lower stress levels.
LosEagle@reddit
Fair enough. That makes sense.
-dysangel-@reddit (OP)
That was Q4_K_S
Legitimate-Pumpkin@reddit
Give it water guns and a second player and take my money 💪
HavenTerminal_com@reddit
we really did go from hiding a flight sim in excel to just... this
technofox01@reddit
Did you make a game with this or is it flying the drone/airplane?
-dysangel-@reddit (OP)
I was controlling the plane. Since most models can handle tetris well these days, I've started asking them for things like a relaxing flight sim, or GTA-style 3D city etc. This was the most solid attempt at a flight sim that I've seen so far - really decent graphics and didn't require much feedback to get the controls sensible.
Panthau@reddit
You already mentioned the quant, maybe you can give some more infos like system, prompt, t/s, how many prompts or how long did you work on that, how many bug fixing was needed, etc.
-dysangel-@reddit (OP)
This is running on my M3 Ultra. The generated world was like this first try, but I had to do some iteration on the keyboard controls so that it would pitch in local space instead of world space, and also mentioned the prop was facing the wrong way (which it fixed with no more details from me - though you can see the cockpit glass is still the wrong way). So about 30 seconds of input on my part, maybe 10 minutes for generation time?
I more just want to raise awareness of the model, since if I've ever tested a Stepfun model before, I didn't think it was worth the drive space. But this one seems pretty special.