StepFun 3.7 Flash | TheaterFire

[-]

ortegaalfredo@reddit

This is a very strange model. Its thinking process is basically incomprehensible. It writes like a lunatic with autism. But then it stops and produces a perfect answer better than models >1TB in size. Apparently they fixed the 'infinite thinking' bug that 3.5 had, and now its quite usable.

This might be it, if you have 4x3090s or better.

[-]

digitalfreshair@reddit

Better than minimax m2.7 for ds4 flash for coding?

[-]

ortegaalfredo@reddit

Honestly, not. I'm testing it with the nigthly version of vllm so it alway has bugs. It seems very good at coding but it makes many mistakes.

[-]

jdchmiel@reddit

Hmm how did you eval minimax 2.7 vs stepfun 3.5 to determine that?

[-]

ortegaalfredo@reddit

Super easy, just tell them to write a small game. Minimax usually does in one-shot. Stepfun does a great job too, but it makes mistakes. But this is on my local rig, with an experimental vllm nigthly version. Might improve in the future, as many models do.

[-]

llama-impersonator@reddit

so this version no longer yaps for 8k tokens before answer?

[-]

ortegaalfredo@reddit

Tested it on their website and it's very reasonable now.

[-]

nmkd@reddit

Wonder what's the reason for reasoning being reasonable now

[-]

EndlessZone123@reddit

Is there any reason why thinking tokens need to be reasonable?

Could train a modem to think in Shakespeare and output normals results right?

[-]

Zeeplankton@reddit

I think there was a recent study about training models to just not think in language at all but latent space https://arxiv.org/abs/2412.06769

[-]

uhuge@reddit

sucks for debugging the RL training, IMO

[-]

ortegaalfredo@reddit

I would expect that you cannot extract a good output from garbage tokens. Thinking tokens are, after all, just adjusting the output for a more specific prediction.

[-]

-dysangel-@reddit

you're not extracting the tokens, you're caching the thinking state during each token in the KV cache

[-]

ImpressiveSuperfluit@reddit

Technically you can get anything from anything, that's what language is to begin with. There is no universal law saying that I now have to write wobblywubbledidat, I was simply trained for it, if you will. In principle you could make a thinking step thing in various shadings of pink elephant descriptions and it's whatever, so long as those things still carry meaningful vectors. Thinking in, more or less, normal plain text is mostly just convenient, but technically you can do whatever you want.

[-]

evia89@reddit

reasonable

https://old.reddit.com/r/ClaudeAI/comments/1tqd246/opus_48_in_caveman_talking_about_the_difference/

or https://arxiv.org/abs/2502.18600

[-]

EbbNorth7735@reddit

The reasoning is trained through RL. Many simulations to hone the skill. It is what it is and can be anything it's trained to do.

[-]

FuckNinjas@reddit

It was Greek to me.
Brevity is the soul of wit.

That's a good idea. Shall we proceed?

[-]

uhuge@reddit

explain Kio with examples

It looks like you might be referring to Kio, which is a name used for a few different things. The most common references are: A Character: Kio is a main character in the anime/manga series Heroic Age. A Company: Kio is a company that makes wireless audio products (like headphones and speakers). A Programming Library: Kio is a C++ library for asynchronous I/O. Since you asked for examples, I'll explain the most likely one: Kio from the anime Heroic Age.

[-]

DistanceSolar1449@reddit

The user asks: "Explain a sieve of erathosthenes and give me python code for it". Simple request. Provide an explanation of the Sieve of Eratosthenes algorithm and give Python code. Should be straightforward. Use code fences. Probably also include complexity analysis. The user didn't request anything else. So answer in normal language.The user asks: "Explain a sieve of erathosthenes and give me python code for it". Simple request. Provide an explanation of the Sieve of Eratosthenes algorithm and give Python code. Should be straightforward. Use code fences. Probably also include complexity analysis. The user didn't request anything else. So answer in normal language.

The user asks: "What is the smallest palindrome number greater than 132?" This is a straightforward math question. We need to find the smallest palindrome integer > 132.
A palindrome number reads the same forwards and backwards. For numbers > 132:
- 133 is not palindrome (1-3-3 -> 133 reversed is 331, not equal)
- 134, 135, ... up to 141: 141 reversed is 141 => palindrome. So 141 is a palindrome >132.
But we need to double-check if there is any palindrome between 132 and 141. 133, 134, 135, 136, 137, 138, 139, 140 are not palindromes. 141 is palindrome.
Thus answer: 141.
But wait: 133? 1 3 3 reversed is 331, not equal. 134 reversed 431, etc. So indeed 141 is the smallest palindrome >132.
Thus answer: 141.
We might provide a brief explanation.

It's better than gpt-oss reasoning at least lol

[-]

Gullible_Drummer_246@reddit

This is hilarious

[-]

comperr@reddit

What about 2x5090s? And 96gb ram. The other option i have is 5090+3090 and 128gb ram.

[-]

pigeon57434@reddit

sounds like the sign of a smart model to me thats what o3 did basically

[-]

FoxiPanda@reddit

BF16: https://huggingface.co/stepfun-ai/Step-3.7-Flash/
FP8: https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8
NVFP4: https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4
GGUF: https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF

From their HF model page in case you need direct links.

[-]

annodomini@reddit

Step 3.5 Flash ranked below Gemma 4 31B on AA. Here's to hoping that 3.7 does a bit better, though the 3 bit quant may not be too friendly to it. Anyhow, downloading now, would like to compare to MiniMax, Qwen 3.5 122b, and the others.

[-]

annodomini@reddit

OK, have downloaded and Step 3.7 at a 3 bit quant isn't bad so far. Only a few quick tests, nothing comprehensive, but definitely something worthy of trying out if you can fit it.

[-]

llama-impersonator@reddit

AA is a pretty crap bench

[-]

annodomini@reddit

Got any better ones that are as comprehensive in models tested?

[-]

j_osb@reddit

the problem with AA is that it mainly 'benchmarks' agentic performance.

The benchmarks are oversaturated, and the actual benchmark selection is not a good representation of even that performance.

It for example rates many very small models as better than Deepseek R1.
Yes, Deepseek R1 did not support tool calls and that is why it's ranked so low, but at the time of that R1 was still the smartest DS model for a lot of things.

The only good way to test models is to try them on your usecase and evaluate them there.

[-]

annodomini@reddit

I mean, agentic performance is pretty important to me. I need reliable tool calling, so having that included in the bench is helpful. A smart model that doesn't have good tool calling wouldn't be that useful for my use cases.

I'm aware that AA benches aren't perfect, and some models will benchmaxx them; but I haven't found anything better for getting a rough sense of how a model performs, and has such a comprehensive and up to date set of data, so I can use it for comparing most recent models.

I use it to get a rough sense of which models will be worth the download and testing time locally. Of course you need to test against your own representative tasks. But it can be helpful to have a starting point, there are a lot of models out there and I don't have the time to test them all.

[-]

llama-impersonator@reddit

lol, absolutely not. AA are the hype winners.

the benchmarks i liked kinda disappeared. fiction.live's context bench was adversarial enough to show actual differences in ability to handle context, but it stopped getting updated and might've saturated after the agentic RLpocalypse was upon us. dubesor archived his bench, ooba went private, etc.

[-]

ortegaalfredo@reddit

Gemma-4-31B was not better than Qwen-3.6-27B in any way.

I'm seeing Step-3.7-Flash is equal or better than 27B at most things I throw at it.

[-]

ai-infos@reddit

don't get why you're being downvoted, as i get the same feeling

[-]

z_3454_pfk@reddit

Step 3.5 flash could produce novel outputs when compared to even opus and gpt 5.4. so yeah it’s really good.

[-]

EbbNorth7735@reddit

Would love the communities input on how to maximize NVFP4 on a Blackwell RTX 6000 with vision and MTP if that's a thing, if not all good. Also would love to see benchmark comparisons between Qwen 3.6 27B and this model. Will hunt those down next. More or less just typing out loud.

[-]

FoxiPanda@reddit

You'll need 2x RTX Pro 6000 Blackwells to load this up - it's 124GB in NVFP4 mode, but the good news is that you'd be able to pretty easily get model + mmproj + full context window at BF16 / Q8 KV cache and still have VRAM left over I suspect.

[-]

quantier@reddit

Yeah this is what I am anticipating as well. I didn’t see any reference to MTP, do we have MTP support?

This model is looking very interesting 😃

I’ll get to testing soon. I hope to get about 24-32 num seqs at 256K Kv Cache

[-]

beneath_steel_sky@reddit

/u/ilintar is working on it and trying to fast-track mtp support https://github.com/ggml-org/llama.cpp/pull/23274#issuecomment-4573905564

[-]

jld1532@reddit

Here's hoping the unsloth quants are a bit smaller.

[-]

MotokoAGI@reddit

The unsloth quants will be bigger with the dynamic quants.

[-]

FoxiPanda@reddit

It's a ~200B model, so there's no way around it being pretty big. Well suited for >128GB Mac Studios / DGX Spark / 2x RTX 6000 Pro users though.

[-]

jld1532@reddit

I hear ya but unsloth's minimax iq4_xs is only 3gb bigger than this gguf and that is a 230B A10B model. I'm not saying gguf size is always linear but I am becoming an unsloth believer.

[-]

FoxiPanda@reddit

Yeah I agree, you'll probably find something that you can run at a slightly smaller size, but it might vary a bit from minimax due to architectural differences. My bet would be on something like IQ3_XS or some such from Unsloth to fit in 96GB of VRAM with some sort of decent context window (probably not full though).

[-]

Dazzling_Equipment_9@reddit

This is fantastic! Version 3.5 was already amazing, and with the addition of multimodal capabilities, it should be perfect for Strixhalo!

[-]

my_name_isnt_clever@reddit

What quant do you run it at on Strix Halo? I run ~120b regularly but ~200b is more of a challenge. I imagine I'll have to run it headless to have enough free memory.

[-]

Nybio@reddit

Managed to run it on 4070 + 96 VRAM, got about 15 tokens/s. So far hard to tell how much it is better over qwen 3.6 35b and gemma 4 26b

[-]

suicidaleggroll@reddit

I wasn’t impressed with 3.5. The code it generated was just average, and it was awful with tool calls, making stupid mistakes like launching a docker container in the foreground and locking itself up, inability to write certain format files, etc.

Because of Step’s overthinking, it took twice as long to get a result that was half as good as MiniMax, assuming it was able to finish at all (see above issue with it locking itself up). Hopefully they’ve fixed some of these issues in 3.7, but I’m not going to hold my breath that this is some “1T killer” like the bots were claiming about 3.5.

[-]

sixx7@reddit

Completely agree. I think there's a reason you never hear anyone talk about the Step series of models after release. Might be the worst series of models of all the labs. Tool calling (thus agentic use) is just absolute garbage.

[-]

my_name_isnt_clever@reddit

The 3.5 release was discussed plently when it was new, it's just a big ass model for local so it's more niche than something like Qwen 27b. I have a Strix Halo 128b and people in the 128GB unified memory club were raving about 3.5.

[-]

mr_zerolith@reddit

I can't get any model in the \~200b range to generate such good code, but i'm using CLine, not agents.
I hear it's weak in agentic but we've had good luck with opencode/claude code regardless.

What quant are you running and on what?
And when did you evaluate it? 3.5 got much faster a month ago on long context.

I have a 5090 + RTX PRO 6000 here with some OC running Q4_K_M

[-]

MDSExpro@reddit

Because you need 8bit quants. Had same issue for ages, code suddenly got better once I have upgraded from Qwen3.5-122B int4 to int8.

[-]

suicidaleggroll@reddit

I last tested it about 3 months ago. I was using their own Q4_K_S quant on 2x RTX Pro 6000s.

[-]

spaceman_@reddit

StepFun also dropped a PR to llama.cpp: github.com/ggml-org/llama.cpp/pull/23845

[-]

my_name_isnt_clever@reddit

Just clicking the GGUF link and seeing it was uploaded by StepFun themselves is a good sign. They seem to give slightly more of a shit about us than the other labs.

[-]

jacek2023@reddit

I am able to run Q3 locally at a good speed, and 3.7 seems censored, while 3.5 looks uncensored

[-]

JaredsBored@reddit

The 3.5 is the biggest model I can run on my hardware, and it's very useful for whenever I need a model with the most world knowledge as possible. Definitely will give this a download and try

[-]

jdchmiel@reddit

Same. The contest for me will be minimax 2.7 vs stepfun 3.7 in 4 bit as my 'big' local model, usually with only the experts in vram.

[-]

Septerium@reddit

REALLY nice

[-]

Steuern_Runter@reddit

Benchmark results are looking good, I hope it still holds up well after quantization.

[-]

Jealous-Astronaut457@reddit

MTP ?

[-]

rpkarma@reddit

Yes

[-]

Adventurous-Okra-407@reddit

3.5 was very underrated so this makes me happy to see. Gonna spend some time testing it out.

[-]

a_beautiful_rhind@reddit

It's like 400 prompt and 35t/s for me with the old one at Q4_K_L. did surprisingly well for the active params.

[-]

mr_zerolith@reddit

on what hardware?

[-]

a_beautiful_rhind@reddit

4x3090, QQ89 proc

[-]

LeatherRub7248@reddit

what hardware was that on?

[-]

Due_Net_3342@reddit

we need mtp support please

[-]

theologi@reddit

!RemindMe 15 days

[-]

RemindMeBot@reddit

I will be messaging you in 15 days on 2026-06-13 10:13:23 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

nuclearbananana@reddit

No news on the mysterious step 3.6 on nanogpt

[-]

nullmove@reddit

StepFun ran an insider access program for 3.6 and I guess once it concluded they just fixed user feedbacks and called the final model 3.7.

Nanogpt probably got access to it, but because it was unreleased I am not sure they had permissions to proxy/redistribute access to it.

[-]

bambamlol@reddit

lol wtf? I just realized there's literally no mention of this on their site. What model was I using?!

Was this maybe an unofficial beta test? Otherwise, why release 3.7 after 3.5?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

Main_Problem_2696@reddit

196B total, 11B active MoE. Runs on 128GB RAM. SWE-Bench 56.26% beats DeepSeek V4 Flash. Solid for local agent workflows. Used Runable to build a local LLM comparison dashboard. Dropped in benchmark numbers, got clean charts in an afternoon. Made the eval way easier.

[-]

NickCanCode@reddit

Damn, I just added value to my DeepSeek API account yesterday because the temperature is too high these days that I don't want to run local inference. Just found that Stepfun coding plan can use DeepSeek V4 Pro and support multi-model in a lower price.

[-]

drooolingidiot@reddit

Use something like OpenRouter, then you can easily switch between different models and inference providers

[-]

Narsha05@reddit

Hi, where did you see that Stepfun coding plan has deepseek? in their doc i can only find their own models

[-]

NickCanCode@reddit

Oh, I am from HK so I was not reading the English page. It looks like the English page excluded the step-router-v1model from the listing. That model will intelligently route to DeepSeek v4 pro when needed. Not sure if it is a mistake or by design to be excluded from the English users. You may want to ask CS first if you are consider to buy the plan. There may be just update delay on the English page.

https://platform.stepfun.com/docs/zh/step-plan/integrations/reasoning-api
(This one has step-router-v1 from the model list which mention routing)

https://platform.stepfun.ai/docs/en/guides/developer/reasoning
(English page didn't have that model)

[-]

Narsha05@reddit

Damn, thanks for the reply. I thought I was too retarded to find the information.

[-]

SnooPaintings8639@reddit

Did they publish recommended sampling params? I cant find any.

[-]

rpkarma@reddit

None that I’ve seen. Im using theirs from 3.5 - temp 1 top p 0.95

[-]

Zeeplankton@reddit

Can fit into 96gb pro?

[-]

ilintar@reddit

Yaaay, my favorite model got a sequel! *And* they added the old VL tower from Step3-VL, so it's now text + image!

[-]

tarruda@reddit

For a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the Python tool. With Python, Step 3.7 Flash achieves exceptionally strong performance on these benchmarks.

This is looks really promising

[-]

tarruda@reddit

"The prince that was promised" of local LLMs.

[-]

charmander_cha@reddit

Serve bem no opencode?

[-]

ZealousidealBunch220@reddit

This is insanely good news!

[-]

myreala@reddit

Step 3.5 Flash, was already pretty good, So this will be even better. This is a really great model for people who are running Nvidia Spark or something similar. Some people might even get at least decent results with one GPU and a lot of fast system RAM. Something like R9700 + strix halo. And you have SOTA comparable model running locally, Albeit fairly slowly.

[-]

Front-Relief473@reddit

Yes, I'm waiting for the quantitative version of iq4xs.

[-]

craftogrammer@reddit

So looks like I can run this with my 16GB VRAM and 96GB DDR5 RAM, IQ4_XS quant?

[-]

reto-wyss@reddit

Quick test using vllm-nightly and NVFP4 checkpoint on 2x Pro 6k with 64 concurrent requests at relatively shallow context 2200 tg/s.

[-]

DriveSolid7073@reddit

By "caption benchmark," do you mean the VIT (visual component) test in image captions? If so, what are the results? I suppose it all depends on the VIT in this case and the correct instructions. The Gemma 4 was specifically trained for this, but maybe there's something interesting here.

[-]

quantier@reddit

you should use ipc=host if you are running the docker container to minimize memory leakage. Also could be worth optimizing NCCL. But loving the fact that you srw able to do 64 concurrent requests at full context window! Will test soon 😍

[-]