Geekbench 6.7 - Geekbench Blog

[-]

Paed0philic_Jyu@reddit

I think that after this affair it reduces the credibility of Geekbench as a good benchmark. Certainly not of the same quality as SPEC, because the simple paraphrasing of their response is that the choice of whether a workload will or will not make use of SIMD is completely arbitrary.

[-]

CalmSpinach2140@reddit

You can also cheat on SPEC, Intel did so before.

[-]

Paed0philic_Jyu@reddit

With SPEC you can at least compile it yourself and report to the company if any combination of compiler and compiler flags gives unexpected results, which would then be corrected in an update.

This has happened before with SPEC2017.

The company behind Geekbench arbitrarily chooses what workload will make use of different CPU features, even though in this case it is quite trivial to add SIMD to the workloads where Intel's tool replaces the instructions being executed.

[-]

VenditatioDelendaEst@reddit

It depends on whether Intel's tool improves the performance of arbitrary code locally built from source. If it doesn't, allowing it in benchmarks makes the benchmarks unrepresentative of what you actually get as a prospective customer.

-march=native SIMD is legit. "Detect the sha512 hash of the binary and patch in hand-rolled assembly" is not legit.

[-]

Paed0philic_Jyu@reddit

You need to demonstrate that Intel is patching in hand-rolled assembly.

Besides, the example provided by Geekbench talks about a 30% uplift. Even -O3 and -march=native can give 2x-3x gains at minimum depending on the code.

It could simply be a matter of using optimized functions written for Intel that might get invoked over those in the standard math library that is providing the uplift.

In fact, given the workload that Geekbench used as their example - HDR - it is highly likely that indeed is what is happening.

[-]

CalmSpinach2140@reddit

Okay but Geekbench subtests correlate pretty well with SPEC…

Apple leads in SPEC2017 as per multiple SPEC test sources which is also what Geekbench, Cinebench and every other real single threaded application also report.

[-]

Paed0philic_Jyu@reddit

The issue is not correlation but controllability.

If I want SPEC to use only SSE instructions and no AVX, I can set it to do so.

Geekbench however, says that SIMD is only allowed in a certain number of workloads for reasons of validity in making comparisons.

Which is fine, except that the choice of workload where SIMD is allowed and workloads which generate only scalar instructions is the personal taste of the John Poole, the creator of Geekbench.

[-]

CalmSpinach2140@reddit

But that’s the case for any closed source application? Don’t know why you are singling out Geekbench.

So you are saying we should not compare x86 and arm CPUs with closed source apps or benchmarks because Intel loses in them?

But here is the thing Intel also loses against open source apps like Blender or handbrake where you can see the source code.

[-]

Paed0philic_Jyu@reddit

That isn't what I said at all.

[-]

CalmSpinach2140@reddit

You said Geekbench loses credibility to compare CPUs with different ISAs because it’s up to John Poole to decide how much of the is scalar and how much is vectorised?

I’m asking does that also apply to real world closed applications like Photoshop etc? Because we can’t see how of the code is vectorised just like Geekbench.

[-]

Paed0philic_Jyu@reddit

An app like Photoshop is not an exclusively benchmarking app.

Unlike Geekbench, Adobe has a reason why they might change the underlying code to make Photoshop more performant.

Whether by rewriting the scalar parts to make use of more vector operations, or other optimizations. Those changes automatically voids any comparison of Photoshop with Geekbench.

[-]

CalmSpinach2140@reddit

But that’s the problem right? Unless BOT is universal it doesn’t work in any other application other than some limited amount of games.

So why did Intel choose Geekbench for BOT? Why not pick Cinebench which also uses scalar code?

When Adobe makes those changes it’s applicable to all x86 users and is not only for a certain CPU generation.

[-]

Paed0philic_Jyu@reddit

BOT is most likely a preview of what is to come when it is iterated upon further. Even when it becomes more widespread, it will still apply to Intel CPUs only.

So it would be up to software vendors to make their apps use more SIMD. When that happens, the gains from BOT will automatically become negligible.

I see BOT not as an Intel trick to win benchmarks but a call to software developers to up their act re. software optimization.

[-]

Polar_Banny@reddit

Still it doesn’t change the fact that this is a marketing benchmark, when it comes to CPU Performance I believe there is no equivalent to SPEC2017

[-]

-protonsandneutrons-@reddit (OP)

What precisely makes something a "marketing" benchmark? That all major CPU manufacturers (Intel, AMD, Arm, Qualcomm, etc., excluding Apple) run it in their marketing? The same applies to SPEC, Cinebench, CrossMark, UL Procyon, etc.

I believe there is no equivalent to SPEC2017

Intel (or any CPU firm), when given the chance, will also find faults about SPEC. That's normal: every benchmark has its limits. No benchmark will ever model your workload, but a representative average / geomean of workloads by a target audience. I won't repeat Veedrac verbatim (though most in this sub would benefit to read it a few times), but he wrote a great essay on synthetics: The fallacy of ‘synthetic benchmarks’ : r/hardware

In the end, Geekbench and SPEC are very correlated. I'd love to see what CPU benchmark is more closely correlated than this:

CPU	SPEC2017 Geomean	GB6.5 Top Score
Apple M5	15.36	4326
Apple M4	14.49	3838
AMD 9700X	12.41	3373
Intel X9 388H	11.55	3020
Intel 285H	10.99	3075
AMD HX 370	10.14	2965
Pearson Correlation	Total correlation r=0.9703

Geekerwan SPEC2017 geomean (source A, source B) versus top GB6 results from NBC. This is a Pearson correlation. Will there be some outliers? Obviously, these are not carbon copies of the same test. But Geekbench 6.3+ is, from actual data, the closest almost anyone on r/hardware will get to running a SPEC test.

[-]

Polar_Banny@reddit

You forgot about MT scores and some links.

[-]

CalmSpinach2140@reddit

No one who understands Geekbench 6 uses it for MT. It’s for ST

[-]

virtualmnemonic@reddit

GB6 multithreaded benchmark is a response to vendors like Intel adding a shit ton of cores to drive up theoretical performance, but it not translating into real world usage.

Seriously, what are consumers realistically doing on their computers that can be parallelized across 16+ threads? Single thread performance still remains king.

[-]

GHz-Man@reddit

But are consumers really buying those chips? Outside of gaming PCs, most people aren't, they're probably buying a Lenovo laptop or MacBook or tablet.

Most software is at least somewhat parallelized now, even web browsers.

Apple's cheapest laptop at $499 has 6 cores, so clearly it would be a good idea for software to take advantage of multiple cores.

[-]

virtualmnemonic@reddit

GB6 MT does an excellent job at measuring performance on tasks like browsing; tasks that do utilize multiple threads, but do not scale linearly with core count like video encoding.

6 cores is a good spot for consumer applications. Intel was pumping over 16 threads into their i5's to make them competitive. They've since slowed down, reversing the trend by removing SMT.

[-]

GHz-Man@reddit

Video encoding hasn't been done on the CPU for like a decade now.

Apple, Intel, Nvidia, and AMD GPUs all have hardware encoding support, which is many times faster than software encoding on the CPU.

I can't speak to Windows software, but everything I've seen on my Mac seems to scale well across the 10 cores in my MacBook Air.

[-]

CalmSpinach2140@reddit

You can do cpu encoding in handbrake, for bluray rips etc

[-]

GHz-Man@reddit

You can, but why would you want to? It's several times slower than hardware encoding on the GPU.

[-]

ParthProLegend@reddit

Exactly

[-]

DerpSenpai@reddit

Geekbench MT is good, wtf are you on about. It's not perfect multithread like Geekbench 5 and that is why it's much better than Cinebench on MT. Cinebench is still a good thing to track ofc, but Geekbench MT checks how real workloads use MT and not full paralel ones.

[-]

ParthProLegend@reddit

Stupid derp is back

[-]

-protonsandneutrons-@reddit (OP)

I picked 1T as Geekbench is predominantly referenced as 1T.

But it's a fair point. MT testing asks us a separate question: do we want to model embarrassingly parallel / separate task multitasking, where each core runs a small and separate workload? That aligns more with enterprise, scientific, or professional workloads. Or do we want to model shared task multitasking, where all cores work on a large, single task? That is more aligned to consumer workloads.

Geekbench 6 is designed to model consumer usage, so it chose the latter. This not only stresses the microarchitecture(s), but also the caches, uncore, fabric, interconnects, even RAM, etc. Other benchmarks—like Cinebench, SPECrate—don't make the same decision because they are modeling non-consumer workloads. In that sense, Geekbench 6 MT is not correlated to SPECrate run on separate cores.

But none of these make any of these benchmarks "marketing benchmarks", which is amorphous and loaded. Under MT, Geekbench 6 answer different questions than SPECrate.

[-]

Verite_Rendition@reddit

What precisely makes something a "marketing" benchmark? That all major CPU manufacturers (Intel, AMD, Arm, Qualcomm, etc., excluding Apple) quote them in their marketing slides?

To be fair, all of those companies would quote the National Enquirer (or AnTuTu) if they thought it would help convince people and move sales. But I agree that Geekbench is not a trivial and/or useless benchmark.

[-]

beneficiarioinss@reddit

If what Intel did was targeted compiled code optimization, what stops apple or any android brand from doing the same? Last I checked Xiaomi is literally quoting AnTuTu benchmark numbers on their marketing pages. Wouldn't those brands do the same simply not announce it like Intel did. They already give benchmarks targeted optimizations by allowing only those to run at peak frequencies

[-]

laffer1@reddit

I hope with the next major release they increase the time the benchmark runs. The results are often useless because it’s so quick that boost clocks can cause wins for chips that aren’t very good with sustained load.

[-]

hibbel@reddit

I simply don't think that Geekbench is the benchmark you're looking for. It's supposed to mimic real-world use cases. And most of what we do on our PCs is very bursty. A 10 minute loop would reflect most people's workloads very poorly.

There's benchmarks for thiat but Geekbench isn't one of them and doesn't try to be it.

[-]

laffer1@reddit

I don't exclusively use geekbench results, but since a lot of 'leaks' and news on new CPUs tend to use Geekbench, it gives a lot of false hope. Intel results are almost always wrong and qualcomm is often more disappointing also. AMD and Apple results seem closer to reality.

When buying CPUs, I use the following:
Compiler benchmarks from Phoronix and to a lesser degree gamersnexus.
LZMA/7zip results
Secondary: gaming benchmarks from GN/Jayztwocents/HUB (gaming system only)
Cinebench: shows sustained load results even though it's not what I actually do. R23 load is very similar to compiler load

GN benchmarks are also skewed for chromium compile since they are in windows. If you compile in other operating systems, particularly ones without thread director support, they are way off. Intel is much worse in real world. Raptor lake was terrible. Arrow lake does a lot better since the e cores improved a lot.

My primary use case is building an OS and packages for that OS (which use lzma compression/decompression a lot too) .

[-]

Sopel97@reddit

or use stockfish

[-]

GHz-Man@reddit

Isn't that just usually the fault of not enough cooling?

The chip won't throttle if it has enough cooling.

[-]

laffer1@reddit

No. Think about the raptor lake parts. They have great boost clocks but suck after.

Regardless of cooling, they will eventually saturate the cooler. Knowing how fast the chips are for sustained load has value for folks like me that compile software for long periods.

One of my desktops has a custom loop with 3 radiators: 420mm, 120mm and 280mm. It takes ten minutes to saturate and sixteen minutes to build one piece of software on the 14700k. It only took 10 minutes on a 3950x with the 420mm radiator. It takes ten minutes on a 265k with the three rads. It’s six minutes on a ryzen 7900 (no x) with a thermalright cooler. Same workload, very different results

[-]

GHz-Man@reddit

Ah, I know Apple's chips don't do that. On the Macs with fans in them, they don't thermal throttle with sustained workloads, but the old Intel Macs sometimes did if the cooling wasn't enough.

[-]

laffer1@reddit

The MBP is pretty consistent I use for work on compiler workloads.

[-]

VastTension6022@reddit

They definitely could add a ~10 min loop option, but it shouldn't replace the current setup. Burst/thermal-solution performance is still important.

[-]

laffer1@reddit

It doesn't have to be ten minutes but it should be several on a current high end CPU. Most users don't have the cooling I have.

[-]

toniyevych@reddit

If they really want to make scores remotely comparable between systems, they need to disable SME, AVX-512 and other similar stuff.

Geekbench clearly admits that in the 6.3 release notes: https://www.geekbench.com/blog/2024/04/geekbench-63/

Quote: "For systems without SME instructions, Geekbench 6.3 CPU Benchmark scores are comparable with Geekbench 6.1 and Geekbench 6.2 scores. Systems with SME instructions enabled will score higher in Geekbench 6.3 than in earlier Geekbench versions.. Geekbench 6.3 GPU Benchmark scores are compatible with Geekbench 6.2 for all systems."

[-]

VastTension6022@reddit

If they want scores to be truly comparable they should also disable those pesky floating point coprocessors!

[-]

ResponsibleJudge3172@reddit

It's a question of consistency

[-]

trololololo2137@reddit

outdated processors getting bad scores is perfectly fine

[-]

okoroezenwa@reddit

You continue to be stupid about this. GB “admits” this discrepancy because non-SME systems had things like SVE, AVX512-VNNI, etc added in 6.1. That’s why for those systems 6.1 and above are comparable with each other but not 6.0, whereas for SME systems 6.3/6.4 are comparable with those above but not the lower versions).