StepFun 3.7 Flash - Speed Benchmark in M5 Max

Posted by Beamsters@reddit | LocalLLaMA | View on Reddit | 11 comments

Just ran a benchmark with day-0 shipped llama.cpp's branch.

M5 Max: 128 GB - Q4_K_S / memory peak around \~120+ GB making things sluggish but still usable once cmd+tab landed.

Short context < 16k feels fast and very responsive. 32k-64k's speed is not bad, usable.

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
0	128	1	128	0.000	nan	2.038	62.80	2.038	62.80
2048	128	1	2176	1.938	1056.65	2.115	60.52	4.053	536.88
8192	128	1	8320	9.153	895.01	2.233	57.32	11.386	730.71
16384	128	1	16512	22.428	730.52	2.475	51.71	24.903	663.05
32768	128	1	32896	64.539	507.73	2.818	45.43	67.356	488.39
65536	128	1	65664	178.227	367.71	3.774	33.92	182.001	360.79

Now Pelican bench - very nice one but with quite a long hand lol

[-]

ortegaalfredo@reddit

Stepfun also published their own speed benchmarks in Apple, DGX and AMD 395+ on their blogpost.

LegacyRemaster@reddit

Dowloading. Will test on rtx 6000 96gb + w7800 48gb q_4_ks

MikeLPU@reddit

How are you going run it together? Rpc?

i'm lazy now

Maximum_Parking_5174@reddit

Nvidia Blackwell using Vulcan? Does that work?

tarruda@reddit

I think the IQ4_XS will be a better choice for 128G. Should have similar performance to Q4_K_S while saving around 6GB of RAM.

rpkarma@reddit

Yep which means you can enable vision!

ok fast it's fast. We will see long context

flash numbers always look hot on short prompts but m5 max falls off a cliff once kv cache pressure kicks in past 8k. prompt processing speed is what really hurts these unified memory builds on real work, not the generation tok/s everyone screenshots. one honest 32k haystack run is worth ten more hello-world charts

sagiroth@reddit

The only reliable benchmark