StepFun 3.7 Flash
Posted by Everlier@reddit | LocalLLaMA | View on Reddit | 109 comments
StepFun dropped Step 3.7 Flash, 196B total / 11B active MoE, runs locally on 128GB RAM
It's a multimodal MoE (196B total params, only 11B active) with a built-in 1.8B ViT for vision.
Benchmark highlights vs. other flash-tier models:
- SWE-Bench Pro: 56.26% (beats DeepSeek V4 Flash at 55.6%, matches Gemini 3.5 Flash at 55.1%)
- DeepSearchQA F1: 92.82%, competitive with GPT 5.5 (93.98%)
- HLE w/ tools: 47.2%, solid for a flash-class model
Essentially punches well above its active parameter weight on agentic and coding tasks. If you've got the RAM for it, looks like a genuinely interesting local option, especially for agent workflows.
Available on OpenRouter and NVIDIA NIM if you don't want to self-host.
ortegaalfredo@reddit
This is a very strange model. Its thinking process is basically incomprehensible. It writes like a lunatic with autism. But then it stops and produces a perfect answer better than models >1TB in size. Apparently they fixed the 'infinite thinking' bug that 3.5 had, and now its quite usable.
This might be it, if you have 4x3090s or better.
digitalfreshair@reddit
Better than minimax m2.7 for ds4 flash for coding?
ortegaalfredo@reddit
Honestly, not. I'm testing it with the nigthly version of vllm so it alway has bugs. It seems very good at coding but it makes many mistakes.
jdchmiel@reddit
Hmm how did you eval minimax 2.7 vs stepfun 3.5 to determine that?
ortegaalfredo@reddit
Super easy, just tell them to write a small game. Minimax usually does in one-shot. Stepfun does a great job too, but it makes mistakes. But this is on my local rig, with an experimental vllm nigthly version. Might improve in the future, as many models do.
llama-impersonator@reddit
so this version no longer yaps for 8k tokens before answer?
ortegaalfredo@reddit
Tested it on their website and it's very reasonable now.
nmkd@reddit
Wonder what's the reason for reasoning being reasonable now
EndlessZone123@reddit
Is there any reason why thinking tokens need to be reasonable?
Could train a modem to think in Shakespeare and output normals results right?
Zeeplankton@reddit
I think there was a recent study about training models to just not think in language at all but latent space https://arxiv.org/abs/2412.06769
uhuge@reddit
sucks for debugging the RL training, IMO
ortegaalfredo@reddit
I would expect that you cannot extract a good output from garbage tokens. Thinking tokens are, after all, just adjusting the output for a more specific prediction.
-dysangel-@reddit
you're not extracting the tokens, you're caching the thinking state during each token in the KV cache
ImpressiveSuperfluit@reddit
Technically you can get anything from anything, that's what language is to begin with. There is no universal law saying that I now have to write wobblywubbledidat, I was simply trained for it, if you will. In principle you could make a thinking step thing in various shadings of pink elephant descriptions and it's whatever, so long as those things still carry meaningful vectors. Thinking in, more or less, normal plain text is mostly just convenient, but technically you can do whatever you want.
evia89@reddit
https://old.reddit.com/r/ClaudeAI/comments/1tqd246/opus_48_in_caveman_talking_about_the_difference/
or https://arxiv.org/abs/2502.18600
EbbNorth7735@reddit
The reasoning is trained through RL. Many simulations to hone the skill. It is what it is and can be anything it's trained to do.
FuckNinjas@reddit
That's a good idea. Shall we proceed?
uhuge@reddit
DistanceSolar1449@reddit
It's better than gpt-oss reasoning at least lol
Gullible_Drummer_246@reddit
This is hilarious
comperr@reddit
What about 2x5090s? And 96gb ram. The other option i have is 5090+3090 and 128gb ram.
pigeon57434@reddit
sounds like the sign of a smart model to me thats what o3 did basically
FoxiPanda@reddit
From their HF model page in case you need direct links.
annodomini@reddit
Step 3.5 Flash ranked below Gemma 4 31B on AA. Here's to hoping that 3.7 does a bit better, though the 3 bit quant may not be too friendly to it. Anyhow, downloading now, would like to compare to MiniMax, Qwen 3.5 122b, and the others.
annodomini@reddit
OK, have downloaded and Step 3.7 at a 3 bit quant isn't bad so far. Only a few quick tests, nothing comprehensive, but definitely something worthy of trying out if you can fit it.
llama-impersonator@reddit
AA is a pretty crap bench
annodomini@reddit
Got any better ones that are as comprehensive in models tested?
j_osb@reddit
the problem with AA is that it mainly 'benchmarks' agentic performance.
The benchmarks are oversaturated, and the actual benchmark selection is not a good representation of even that performance.
It for example rates many very small models as better than Deepseek R1.
Yes, Deepseek R1 did not support tool calls and that is why it's ranked so low, but at the time of that R1 was still the smartest DS model for a lot of things.
The only good way to test models is to try them on your usecase and evaluate them there.
annodomini@reddit
I mean, agentic performance is pretty important to me. I need reliable tool calling, so having that included in the bench is helpful. A smart model that doesn't have good tool calling wouldn't be that useful for my use cases.
I'm aware that AA benches aren't perfect, and some models will benchmaxx them; but I haven't found anything better for getting a rough sense of how a model performs, and has such a comprehensive and up to date set of data, so I can use it for comparing most recent models.
I use it to get a rough sense of which models will be worth the download and testing time locally. Of course you need to test against your own representative tasks. But it can be helpful to have a starting point, there are a lot of models out there and I don't have the time to test them all.
llama-impersonator@reddit
lol, absolutely not. AA are the hype winners.
the benchmarks i liked kinda disappeared. fiction.live's context bench was adversarial enough to show actual differences in ability to handle context, but it stopped getting updated and might've saturated after the agentic RLpocalypse was upon us. dubesor archived his bench, ooba went private, etc.
ortegaalfredo@reddit
Gemma-4-31B was not better than Qwen-3.6-27B in any way.
I'm seeing Step-3.7-Flash is equal or better than 27B at most things I throw at it.
ai-infos@reddit
don't get why you're being downvoted, as i get the same feeling
z_3454_pfk@reddit
Step 3.5 flash could produce novel outputs when compared to even opus and gpt 5.4. so yeah it’s really good.
EbbNorth7735@reddit
Would love the communities input on how to maximize NVFP4 on a Blackwell RTX 6000 with vision and MTP if that's a thing, if not all good. Also would love to see benchmark comparisons between Qwen 3.6 27B and this model. Will hunt those down next. More or less just typing out loud.
FoxiPanda@reddit
You'll need 2x RTX Pro 6000 Blackwells to load this up - it's 124GB in NVFP4 mode, but the good news is that you'd be able to pretty easily get model + mmproj + full context window at BF16 / Q8 KV cache and still have VRAM left over I suspect.
quantier@reddit
Yeah this is what I am anticipating as well. I didn’t see any reference to MTP, do we have MTP support?
This model is looking very interesting 😃
I’ll get to testing soon. I hope to get about 24-32 num seqs at 256K Kv Cache
beneath_steel_sky@reddit
/u/ilintar is working on it and trying to fast-track mtp support https://github.com/ggml-org/llama.cpp/pull/23274#issuecomment-4573905564
jld1532@reddit
Here's hoping the unsloth quants are a bit smaller.
MotokoAGI@reddit
The unsloth quants will be bigger with the dynamic quants.
FoxiPanda@reddit
It's a ~200B model, so there's no way around it being pretty big. Well suited for >128GB Mac Studios / DGX Spark / 2x RTX 6000 Pro users though.
jld1532@reddit
I hear ya but unsloth's minimax iq4_xs is only 3gb bigger than this gguf and that is a 230B A10B model. I'm not saying gguf size is always linear but I am becoming an unsloth believer.
FoxiPanda@reddit
Yeah I agree, you'll probably find something that you can run at a slightly smaller size, but it might vary a bit from minimax due to architectural differences. My bet would be on something like IQ3_XS or some such from Unsloth to fit in 96GB of VRAM with some sort of decent context window (probably not full though).
Dazzling_Equipment_9@reddit
This is fantastic! Version 3.5 was already amazing, and with the addition of multimodal capabilities, it should be perfect for Strixhalo!
my_name_isnt_clever@reddit
What quant do you run it at on Strix Halo? I run ~120b regularly but ~200b is more of a challenge. I imagine I'll have to run it headless to have enough free memory.
Nybio@reddit
Managed to run it on 4070 + 96 VRAM, got about 15 tokens/s. So far hard to tell how much it is better over qwen 3.6 35b and gemma 4 26b
suicidaleggroll@reddit
I wasn’t impressed with 3.5. The code it generated was just average, and it was awful with tool calls, making stupid mistakes like launching a docker container in the foreground and locking itself up, inability to write certain format files, etc.
Because of Step’s overthinking, it took twice as long to get a result that was half as good as MiniMax, assuming it was able to finish at all (see above issue with it locking itself up). Hopefully they’ve fixed some of these issues in 3.7, but I’m not going to hold my breath that this is some “1T killer” like the bots were claiming about 3.5.
sixx7@reddit
Completely agree. I think there's a reason you never hear anyone talk about the Step series of models after release. Might be the worst series of models of all the labs. Tool calling (thus agentic use) is just absolute garbage.
my_name_isnt_clever@reddit
The 3.5 release was discussed plently when it was new, it's just a big ass model for local so it's more niche than something like Qwen 27b. I have a Strix Halo 128b and people in the 128GB unified memory club were raving about 3.5.
mr_zerolith@reddit
I can't get any model in the \~200b range to generate such good code, but i'm using CLine, not agents.
I hear it's weak in agentic but we've had good luck with opencode/claude code regardless.
What quant are you running and on what?
And when did you evaluate it? 3.5 got much faster a month ago on long context.
I have a 5090 + RTX PRO 6000 here with some OC running Q4_K_M
MDSExpro@reddit
Because you need 8bit quants. Had same issue for ages, code suddenly got better once I have upgraded from Qwen3.5-122B int4 to int8.
suicidaleggroll@reddit
I last tested it about 3 months ago. I was using their own Q4_K_S quant on 2x RTX Pro 6000s.
spaceman_@reddit
StepFun also dropped a PR to llama.cpp: github.com/ggml-org/llama.cpp/pull/23845
my_name_isnt_clever@reddit
Just clicking the GGUF link and seeing it was uploaded by StepFun themselves is a good sign. They seem to give slightly more of a shit about us than the other labs.
jacek2023@reddit
I am able to run Q3 locally at a good speed, and 3.7 seems censored, while 3.5 looks uncensored
JaredsBored@reddit
The 3.5 is the biggest model I can run on my hardware, and it's very useful for whenever I need a model with the most world knowledge as possible. Definitely will give this a download and try
jdchmiel@reddit
Same. The contest for me will be minimax 2.7 vs stepfun 3.7 in 4 bit as my 'big' local model, usually with only the experts in vram.
Septerium@reddit
REALLY nice
Steuern_Runter@reddit
Benchmark results are looking good, I hope it still holds up well after quantization.
Jealous-Astronaut457@reddit
MTP ?
rpkarma@reddit
Yes
Adventurous-Okra-407@reddit
3.5 was very underrated so this makes me happy to see. Gonna spend some time testing it out.
a_beautiful_rhind@reddit
It's like 400 prompt and 35t/s for me with the old one at Q4_K_L. did surprisingly well for the active params.
mr_zerolith@reddit
on what hardware?
a_beautiful_rhind@reddit
4x3090, QQ89 proc
LeatherRub7248@reddit
what hardware was that on?
Due_Net_3342@reddit
we need mtp support please
theologi@reddit
!RemindMe 15 days
RemindMeBot@reddit
I will be messaging you in 15 days on 2026-06-13 10:13:23 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
nuclearbananana@reddit
No news on the mysterious step 3.6 on nanogpt
nullmove@reddit
StepFun ran an insider access program for 3.6 and I guess once it concluded they just fixed user feedbacks and called the final model 3.7.
Nanogpt probably got access to it, but because it was unreleased I am not sure they had permissions to proxy/redistribute access to it.
bambamlol@reddit
lol wtf? I just realized there's literally no mention of this on their site. What model was I using?!
Was this maybe an unofficial beta test? Otherwise, why release 3.7 after 3.5?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Main_Problem_2696@reddit
196B total, 11B active MoE. Runs on 128GB RAM. SWE-Bench 56.26% beats DeepSeek V4 Flash. Solid for local agent workflows. Used Runable to build a local LLM comparison dashboard. Dropped in benchmark numbers, got clean charts in an afternoon. Made the eval way easier.
NickCanCode@reddit
Damn, I just added value to my DeepSeek API account yesterday because the temperature is too high these days that I don't want to run local inference. Just found that Stepfun coding plan can use DeepSeek V4 Pro and support multi-model in a lower price.
drooolingidiot@reddit
Use something like OpenRouter, then you can easily switch between different models and inference providers
Narsha05@reddit
Hi, where did you see that Stepfun coding plan has deepseek? in their doc i can only find their own models
NickCanCode@reddit
Oh, I am from HK so I was not reading the English page. It looks like the English page excluded the
step-router-v1model from the listing. That model will intelligently route to DeepSeek v4 pro when needed. Not sure if it is a mistake or by design to be excluded from the English users. You may want to ask CS first if you are consider to buy the plan. There may be just update delay on the English page.https://platform.stepfun.com/docs/zh/step-plan/integrations/reasoning-api
(This one has
step-router-v1from the model list which mention routing)https://platform.stepfun.ai/docs/en/guides/developer/reasoning
(English page didn't have that model)
Narsha05@reddit
Damn, thanks for the reply. I thought I was too retarded to find the information.
SnooPaintings8639@reddit
Did they publish recommended sampling params? I cant find any.
rpkarma@reddit
None that I’ve seen. Im using theirs from 3.5 - temp 1 top p 0.95
Zeeplankton@reddit
Can fit into 96gb pro?
ilintar@reddit
Yaaay, my favorite model got a sequel! *And* they added the old VL tower from Step3-VL, so it's now text + image!
tarruda@reddit
This is looks really promising
tarruda@reddit
"The prince that was promised" of local LLMs.
charmander_cha@reddit
Serve bem no opencode?
ZealousidealBunch220@reddit
This is insanely good news!
myreala@reddit
Step 3.5 Flash, was already pretty good, So this will be even better. This is a really great model for people who are running Nvidia Spark or something similar. Some people might even get at least decent results with one GPU and a lot of fast system RAM. Something like R9700 + strix halo. And you have SOTA comparable model running locally, Albeit fairly slowly.
Front-Relief473@reddit
Yes, I'm waiting for the quantitative version of iq4xs.
craftogrammer@reddit
So looks like I can run this with my 16GB VRAM and 96GB DDR5 RAM, IQ4_XS quant?
reto-wyss@reddit
Quick test using vllm-nightly and NVFP4 checkpoint on 2x Pro 6k with 64 concurrent requests at relatively shallow context 2200 tg/s.
DriveSolid7073@reddit
By "caption benchmark," do you mean the VIT (visual component) test in image captions? If so, what are the results? I suppose it all depends on the VIT in this case and the correct instructions. The Gemma 4 was specifically trained for this, but maybe there's something interesting here.
quantier@reddit
you should use ipc=host if you are running the docker container to minimize memory leakage. Also could be worth optimizing NCCL. But loving the fact that you srw able to do 64 concurrent requests at full context window! Will test soon 😍
mr_zerolith@reddit
What kind of token gen/sec are you getting on this versus 3.5?
On the page they're claiming it's even faster, wonder if that's true
LegacyRemaster@reddit
196b ... heroes
rpkarma@reddit
Oh yeah here we go, because I have MTP hacked in to llama.cpp for 3.5 flash :D stoked to see what this is like
pmttyji@reddit
https://github.com/ggml-org/llama.cpp/pull/23274
rpkarma@reddit
Neat, should compare mine to his. Only thing I was sad about is it was only a 30% speed up in practice on my Spark
mindwip@reddit
Guess I buying a second strix halo or an external GPU for my current strix halo lol
Bemchmarks look nice
Zc5Gwu@reddit
It doesn’t fit on one?
mindwip@reddit
It does, just want a higher q.
kant12@reddit
I gotta say having two is pretty nice!
No_Mango7658@reddit
Egpu for my strix halo has been tempting
jacek2023@reddit
Sounds great. Previous Step Flash was quite usable on my setup. This one is smaller?
silentsnake@reddit
Does the gguf version comes with mtp?
No_Mango7658@reddit
StepFun 3.5 Flash was amazing. So excited for 3.7!!
Thank you
mr_zerolith@reddit
Aw yeah!
At my shop we love this model for coding with a 128gb vram setup, can't wait to try this but i'm betting it will take some days for the model support to be there!
MotokoAGI@reddit
The old one thought so much, it was just better for me to run models twice the size. Hopefully it doesn't over think.
1ncehost@reddit
Impressive benchmarks. We'll see how it holds up to real use.
hp1337@reddit
This is the best size model for my 6x3090 rig. Looking forward to testing this!