[ServeTheHome] SPEC Consortium Releases SPEC CPU 2026 Benchmark Suite: The Next Decade of CPU Benchmarking

[-]

Whirblewind@reddit

Really bizarre that they dropped Blender and x264 with nothing similar to replace them.

I suspect CPU 2017 will endure.

[-]

mduell@reddit

Replacing video encoding (x264) with audio encoding (FLAC) is odd... I'd like to see a contemporary video encoder.

[-]

There is a fantastic technical paper that they've released which talks quite a bit about the selection process. It doesn't explicitly discuss why x264 was removed, but it does discuss why they had to reject so many media encoder workloads.

In short: the real-world implementations of these codecs rely too much on hand-tuned architecture-specific code.

https://arxiv.org/abs/2605.01575

We considered AV1/AOM and Opus, two codecs used widely for internet video and audio. Both make such heavy use of architecture-specific assembly that removing these implementations rendered their performance profiles unrepresentative. And in the case of Opus, the program processed audio so quickly that the workload shifted from CPU-bound to I/O-bound, with multi-copy SPECrate runs stalling on disk activity while the CPU remained idle. 817.flac doesn’t have this issue since the benchmark is a multi-threaded program writing into just a single file.

[-]

mduell@reddit

But that was also true for x264 when they picked it.

[-]

Sopel97@reddit

so they are basically excluding all software that's optimized, i.e. meant to actually run fast, rendering the benchmark useless

[-]

Pristine-Woodpecker@reddit

That's not what it says. They exclude software that is only fast when hand-tuned to a specific CPU. It's supposed to be a vendor-agnostic hardware benchmark, not a "whose implementation has the best hand tuned assembly".

[-]

Sopel97@reddit

they've been including Lc0 so that's not it

[-]

Pristine-Woodpecker@reddit

That's because x264/x265/av1 in practice is done exclusively by using hand-written CPU specific acceleration routines, whereas SPEC requires a benchmark to be CPU agnostic.

An AV1 encoder that doesn't use any assembly probably has a workload profile that doesn't correspond to the real thing at all.

[-]

mduell@reddit

But that was also true for x264 when they picked it.

[-]

Minced-Juice@reddit

Good. Nobody does 3D rendering or video encoding by assigning one copy of the program per hardware CPU thread - like it is done for the rate benchmarks.

The new rate benchmarks have more representative workloads like Python interpreter and zstd which are actually run in that way in the real world.

[-]

Sopel97@reddit

It's very common to use only a few threads for video encoding jobs and instead use chunked encoding to create more sources that can be encoded in parallel. Video encoding does not parallelize well. Even with SVT-AV1 you're going to have scaling problems past 4 threads on slower presets.

[-]

mduell@reddit

FLAC is a bit of an odd choice, since audio encoding is rarely hugely limiting compared to modern H.265/AV1 video encoding.

[-]

DerpSenpai@reddit

Modern video codecs have manual architecture made assembly for it. Removing that assembly isn't representative and using it makes it that cpus need to be optimised for that benchmark. So they removed it

[-]

mduell@reddit

But that was also true for x264 when they picked it.

[-]

Interesting-Union-43@reddit

I wish they provide some cheap licenses for enthusiasts. Even 750 USD for non-profit license is a lot

[-]

Geddagod@reddit

I've heard that branches are now apparently down across the board vs spec2017, which is pretty surprising. Apparently the front end is also being pushed much harder now though, which is interesting.

Can't wait for the "memory center characterization of Spec2026" and other PMC based analysis papers of this benchmark.

It's also going to be interesting to see how popular specint2026 is going to be with the major vendors in their benchmark suites for slides. Today companies like AMD and Intel regularly reference specint2017 in their slides, but many of the risc-v companies still use specint2006 quite extensively in their cpu bechmarks/perf projections.

[-]

R-ten-K@reddit

SPEC is rarely used for customer slides. But they are the benches we use most heavily during design/simulation and for conference/journal publications.

[-]

Geddagod@reddit

Maybe not directly consumer facing slides, but Intel regularly cites specint2017 when they make core IPC gains at stuff like Intel tech tours, and also use specint for stuff like server performance claims in their server launch event slides. So sort of a middle ground for client customer slides for the general public and NDA-ed internal documents lol.

And ofc, as you said, conferences/journal publications too.

[-]

R-ten-K@reddit

that makes more sense.

[-]

ThankGodImBipolar@reddit

but many of the risc-v companies still use specint2006 quite extensively in their cpu bechmarks/perf projections.

Do you know what the reason for that is?

[-]

R-ten-K@reddit

They’re much easier to simulate because they involve smaller datasets and benchmarks. Something like 2006 is also very well understood at this point, so simulators can be heavily tuned for it.

Most RISC-V startups don’t have silicon, so they rely primarily on simulation data. And since their compute and simulation resources are often limited, they naturally gravitate toward workloads that are cheaper and faster to model.

[-]

Geddagod@reddit

Doubting the veracity of this claim based on this comment:

It's not really particularly more runnable or more useful than the 2k17. You don't typically run full benchmarks for simulation purposes, you sample interesting regions and assign them weights based on their importance. There are papers out there that explain how this is done and you can get away with running fractions of the actual workload while still achieving nearly perfect correlation with the full results.

Whose claims do seem to be supported based on these papers.

[-]

R-ten-K@reddit

SPEC06 has smaller datasets and memory footprint. It really is not a mystery that it is easier to simulate due to reduced execution times. Everybody does some form or another of statistical sampling when running simulations.

[-]

Geddagod@reddit

Nope. It's kinda funny I remember this exact point also being discussed online a couple weeks ago, and I was also interested in why this seems to be the case. These two comments on a previous post a while ago about this topic were pretty interesting to me.

This point in particular stands out to me:

It is significantly easier to do better on the 2k6 workloads, IIRC they have much lower instruction and data cache requirements and so you can stay competitive even without caches that are competitive in today's market. In particular, they're much less memory dependent which is where a lot of modern core design focus is.

[-]

Vince789@reddit

Great to see more C & C++ vs Fortran

There'll probably still be some variation based on compiler/compiler flags, but hopefully there's less variation compared to 2016

[-]

RyanSmithAT@reddit

There'll probably still be some variation based on compiler/compiler flags

Oh there definitely is. Patrick and I were looking at some of the vendor-supplied runs today; Arm turned in a DGX Spark submission that was some 38% higher than ours. Whereas we just used -O3 (this being the default for CPU 2026 in SPEC's config files) Arm also went with -ffast-math -flto=thin -fomit-frame-pointer -fuse-ld=lld. Though they also cranked up the power limit of the system to performance mode and set ulimit to unlimited, so we haven't had a chance to fully isolate the effects of each of those changes.

[-]

Vince789@reddit

Oh, that's a huge difference

Although to clarify, I understand hardware vendors are always gonna use highly tuned flags for their perf claims

Sorry, I meant hopefully that there's less variation from the typical default flags that are used by most third-party reviewers so that estimates are comparable between different reviewers

AFAIK even just using the default flags in CPU2017, the difference between GCC vs Clang can make a difference in scores as Clang's relatively weaker Fortran

Thus why its usually not recommend that we compare SPEC esitmates from different reviewers, unless we know they used the same compiler+flags

[-]

IBM296@reddit

Not bad. Will help to see how good CPU's actually are nowadays.

[-]

Noble00_@reddit (OP)

Excerpt from article:

The 2026 edition of the benchmark suite has 52 benchmarks in all, 9 more than the 2016 suite. Of those, 38 are all new benchmarks. Only 14 benchmarks were kept from the 2016 suite, particularly evergreen software such as GCC, LLVM, and various data compression utilities, and even then, all of those benchmarks have been updated to both use their latest code and to use newer and deeper workloads.

With 52 benchmarks in all, there is more to cover here than there is time to cover them. Notably, Perl, x264, and Blender have all been removed from the 2026 suite. In its place are new benchmarks such as CPython, FLAC, and SQLite. There are also plenty of computational science workloads, as well as some new industry workloads, such as FPGA place and routing (VPR).

The total number of lines of code has more than doubled, going from around 7.1 million to about 16.7 million. Most of that code belongs to GCC, LLVM, and FemFlow, a finite-element fluid-dynamics simulation.

As you might expect, the latest edition of the suite updates the benchmark suite to use much newer language standards as well. Whereas SPEC CPU 2017 was based around C99, C++03, and Fortran 2003, SPEC CPU 2026 benchmarks are based around C18, C++17, and Fortran 2018 – all of which are around 15 to 20 years newer in age. So the constituent benchmarks all have access to many newer language features, most notably C++ threading (std::thread) and Fortran concurrency (DO_CONCURRENT). The latter changes primarily impact the SPECspeed benchmarks, as SPECrate explicitly runs multiple copies of a single program rather than using multithreading within a program.