[ServeTheHome] SPEC Consortium Releases SPEC CPU 2026 Benchmark Suite: The Next Decade of CPU Benchmarking
Posted by Noble00_@reddit | hardware | View on Reddit | 29 comments
STH provides 3rd party testing of their own. But, there are also published results if you want to see here (đź‘€Included are some M5 Pro Macbooks):
https://www.spec.org/cpu2026/results/cpu2026/
Here are the docs:
Whirblewind@reddit
Really bizarre that they dropped Blender and x264 with nothing similar to replace them.
I suspect CPU 2017 will endure.
mduell@reddit
Replacing video encoding (x264) with audio encoding (FLAC) is odd... I'd like to see a contemporary video encoder.
RyanSmithAT@reddit
There is a fantastic technical paper that they've released which talks quite a bit about the selection process. It doesn't explicitly discuss why x264 was removed, but it does discuss why they had to reject so many media encoder workloads.
In short: the real-world implementations of these codecs rely too much on hand-tuned architecture-specific code.
https://arxiv.org/abs/2605.01575
mduell@reddit
But that was also true for x264 when they picked it.
Sopel97@reddit
so they are basically excluding all software that's optimized, i.e. meant to actually run fast, rendering the benchmark useless
Pristine-Woodpecker@reddit
That's not what it says. They exclude software that is only fast when hand-tuned to a specific CPU. It's supposed to be a vendor-agnostic hardware benchmark, not a "whose implementation has the best hand tuned assembly".
Sopel97@reddit
they've been including Lc0 so that's not it
Pristine-Woodpecker@reddit
That's because x264/x265/av1 in practice is done exclusively by using hand-written CPU specific acceleration routines, whereas SPEC requires a benchmark to be CPU agnostic.
An AV1 encoder that doesn't use any assembly probably has a workload profile that doesn't correspond to the real thing at all.
mduell@reddit
But that was also true for x264 when they picked it.
Minced-Juice@reddit
Good. Nobody does 3D rendering or video encoding by assigning one copy of the program per hardware CPU thread - like it is done for the rate benchmarks.
The new rate benchmarks have more representative workloads like Python interpreter and zstd which are actually run in that way in the real world.
Sopel97@reddit
It's very common to use only a few threads for video encoding jobs and instead use chunked encoding to create more sources that can be encoded in parallel. Video encoding does not parallelize well. Even with SVT-AV1 you're going to have scaling problems past 4 threads on slower presets.
mduell@reddit
FLAC is a bit of an odd choice, since audio encoding is rarely hugely limiting compared to modern H.265/AV1 video encoding.
DerpSenpai@reddit
Modern video codecs have manual architecture made assembly for it. Removing that assembly isn't representative and using it makes it that cpus need to be optimised for that benchmark. So they removed it
mduell@reddit
But that was also true for x264 when they picked it.
Interesting-Union-43@reddit
I wish they provide some cheap licenses for enthusiasts. Even 750 USD for non-profit license is a lot
Geddagod@reddit
I've heard that branches are now apparently down across the board vs spec2017, which is pretty surprising. Apparently the front end is also being pushed much harder now though, which is interesting.
Can't wait for the "memory center characterization of Spec2026" and other PMC based analysis papers of this benchmark.
It's also going to be interesting to see how popular specint2026 is going to be with the major vendors in their benchmark suites for slides. Today companies like AMD and Intel regularly reference specint2017 in their slides, but many of the risc-v companies still use specint2006 quite extensively in their cpu bechmarks/perf projections.
R-ten-K@reddit
SPEC is rarely used for customer slides. But they are the benches we use most heavily during design/simulation and for conference/journal publications.
Geddagod@reddit
Maybe not directly consumer facing slides, but Intel regularly cites specint2017 when they make core IPC gains at stuff like Intel tech tours, and also use specint for stuff like server performance claims in their server launch event slides. So sort of a middle ground for client customer slides for the general public and NDA-ed internal documents lol.
And ofc, as you said, conferences/journal publications too.
R-ten-K@reddit
that makes more sense.
ThankGodImBipolar@reddit
Do you know what the reason for that is?
R-ten-K@reddit
They’re much easier to simulate because they involve smaller datasets and benchmarks. Something like 2006 is also very well understood at this point, so simulators can be heavily tuned for it.
Most RISC-V startups don’t have silicon, so they rely primarily on simulation data. And since their compute and simulation resources are often limited, they naturally gravitate toward workloads that are cheaper and faster to model.
Geddagod@reddit
Doubting the veracity of this claim based on this comment:
Whose claims do seem to be supported based on these papers.
R-ten-K@reddit
SPEC06 has smaller datasets and memory footprint. It really is not a mystery that it is easier to simulate due to reduced execution times. Everybody does some form or another of statistical sampling when running simulations.
Geddagod@reddit
Nope. It's kinda funny I remember this exact point also being discussed online a couple weeks ago, and I was also interested in why this seems to be the case. These two comments on a previous post a while ago about this topic were pretty interesting to me.
This point in particular stands out to me:
Vince789@reddit
Great to see more C & C++ vs Fortran
There'll probably still be some variation based on compiler/compiler flags, but hopefully there's less variation compared to 2016
RyanSmithAT@reddit
Oh there definitely is. Patrick and I were looking at some of the vendor-supplied runs today; Arm turned in a DGX Spark submission that was some 38% higher than ours. Whereas we just used
-O3(this being the default for CPU 2026 in SPEC's config files) Arm also went with-ffast-math -flto=thin -fomit-frame-pointer -fuse-ld=lld. Though they also cranked up the power limit of the system to performance mode and setulimitto unlimited, so we haven't had a chance to fully isolate the effects of each of those changes.Vince789@reddit
Oh, that's a huge difference
Although to clarify, I understand hardware vendors are always gonna use highly tuned flags for their perf claims
Sorry, I meant hopefully that there's less variation from the typical default flags that are used by most third-party reviewers so that estimates are comparable between different reviewers
AFAIK even just using the default flags in CPU2017, the difference between GCC vs Clang can make a difference in scores as Clang's relatively weaker Fortran
Thus why its usually not recommend that we compare SPEC esitmates from different reviewers, unless we know they used the same compiler+flags
IBM296@reddit
Not bad. Will help to see how good CPU's actually are nowadays.
Noble00_@reddit (OP)
Excerpt from article: