People say that 'pure' CISC CPU's are dead, and modern x86 processors internally convert x86 calls to 'RISC-like' instructions. How true is this (and many other questions)?

[-]

Tai9ch@reddit

Here's another, more modern, question to ask if you want to understand more about processor design and how we got where we are today: What happened to VLIW?

[-]

The basic premise of VLIW is that you have a bunch of proccessors/cores/whatever terminology you want to use sharing a simple front end, which executes the same instruction on multiple pieces of data at the same time in parallel.
And so you can effectively get more cores/processors per transistor count/die area, because the front end is taking up less of it.

With the front end being much simpler, it doesn't do a lot of optimisations on the fly, but relies a lot on the compiler to optimise things in advance.
But it also inherently relies on the ability to parallelise the code you're running, too. If the code just inherently doesn't parallelise well, you'll have a lot of the cores/processors sitting idle.

As an example, the Radeon HD 5870 used a VLIW5 architecture, which had groups of 5 shader processors. But in practice, it turned out that the majority of games used an average of a little bit more than 3 out of 5 shaders.
So, when AMD made the Radeon HD 6870, they moved to VLIW4, with groups of 4 shader processors, so there'd be less idle shader processors. And therefore it would be about as fast as a 5870 in games with fewer shaders overall (1280 vs 1600).

[-]

Tai9ch@reddit

VLIW is much older than its use in graphics cards. The earlier work was focused on single core processors executing single threaded tasks.

[-]

nittanyofthings@reddit

VLIW is impossible to program for. Keeping those parallel resources from having stalls is just to hard. Much of the scheduling info you need isn't available until runtime, not during compilation.

[-]

R-ten-K@reddit

I think a lot of that perception revolves around the initial FUD lobbed at the Itanium compiler. Which sort of turned to not be quite correct, the compiler didn't have that hard of a time filling the bundles (granted the IA64 bundle wasn't particularly "wide")

[-]

EmergencyCucumber905@reddit

Im curious of modern machine learning methods can produce optimal code for VLIW.

[-]

Tai9ch@reddit

Much of the scheduling info you need isn't available until runtime, not during compilation.

Sounds like yet another reason why everything should use a tracing/profiling JIT compiler.

[-]

R-ten-K@reddit

VLIW went onto the areas where it made sense: GPUs and DSPs (or NPUs are they are known now).

Turns out that VLIW and data-parallel programming models work fairly well.

[-]

itsjust_khris@reddit

What's a good resource to begin understanding the answer to this question? Off the top of my head the last time I heard about VLIW was AMDs 5000 and 6000 series graphics cards.

[-]

DaMan619@reddit

Raymond Chen's blog posts on Itanic helped me learn this was never going to work

[-]

Tai9ch@reddit

Have a discussion with an LLM about it. Try to convince the LLM that Itanium was better than AMD64.

[-]

Nuck_Chorris_Stache@reddit

There are basically no "pure CISC" CPUs today, and there are also barely any "pure RISC" CPUs.
They kind of fall on a spectrum, but are basically somewhere in between.

[-]

Dghelneshi@reddit

Forget about the terms RISC and CISC, they're useless. RISC was a specific design philosophy in the 80s and was primarily an answer to very heavily microcoded designs that don't exist anymore, CISC was just "anything that isn't RISC". RISC-V can be variable instruction length depending on extensions, so it's not "RISC" (and thus "CISC"). x86 uops are not a load-store architecture, so it's not "RISC". Yes, high performance ARM designs also use uops. Things have changed since the 80s, it's no longer really useful to talk in these terms and the whole "RISC vs CISC" holy war that people like to propagate is extremely tiring, especially when the term CISC is pretty much entirely meaningless. Talking about individual properties of ISAs (instruction length, load-store, etc.) and what consequences they bear is useful as a point of discussion however.

[-]

_Oman@reddit

TLDR: There hasn't been a true RISC or CISC processor in some time. Both terms are too dated to be useful, at least as far as general purpose processors go.

[-]

Plank_With_A_Nail_In@reddit

RISC came out of C compilers, it was clear that real code didn't touch many of the complex instructions in the CPU.

[-]

h2o2@reddit

I cannot find a reference any more but I distinctly remember John Mashey regretting on comp.arch that RISC was popularised as "reduced", not "regular" (as in fewer & less complex addressing modes, consistent instruction length etc.), which is probably why people stil misinterpret the intent of the term. But your explanation of e.g. ARM and RISC-C properties "not being pure RISC" is spot on and a good example how and why these discussions are not really useful any more.

[-]

trmetroidmaniac@reddit

The idea of RISC and CISC are obsolete.

RISC came about in a particular context. "CISC" processors were not designed as such, it's more that they were labelled it in response to an emerging new design philosophy.

Modern 64-bit ARM chips aren't really RISC. The instruction set has been designed for high instruction-level parallelism, not "simplicity", and certainly not the 1 IPC target of classic RISC.

In particular, 64-bit ARM is largely an Apple creation. The post-RISC ISA of Aarch64 is part of what enabled Apple ARM chips to achieve such high IPC.

[-]

Boeing367-80@reddit

Ok, so what accounts for the greater power efficiency of ARM chips?

[-]

InfernoBlade@reddit

A lot of ARM systems actually aren't particularly efficient. The datacenter ones from Ampere aren't really impressive compared to zen5/zen5c or even current xeons, for example. It's not the ISA, it's the systems design around it, and that design is full of tradeoffs.

As far as Apple themselves go, I think there's a factor a lot of people end up overlooking:

Apple built a memory topology monster in their M-series silicon. The M5 has relatively gigantic 192 KiB L1d/128 KiB L1i and 16 MiB L2 shared caches. Meaning the CPU is less likely to have a miss in its caches as compared to zen5's 48 KiB L1d/32 KiB L1i and 1 MiB private L2 caches. Hell, the efficiency cores on M5 have more cache available to them than full-fat zen5 cores do at L1 and L2.

To make up for fairly small L1/L2, Zen5 then adds an L3 level as a shared victim cache for all cores in a CCD, with an associated latency hit of doing another cache hit test. Doing so does help mitigate the fact that its memory bus is just bulk DDR5. Apple OTOH then goes straight from L2 to a 128-bit wide LPDDR5X at 9.6 Gbps per pin, which is located on-package so the latency on hitting it is relatively tiny. On consumer this means, even assuming like DDR5-6600 speeds, the zen5 part has to fill its L3 lines with ~105 GB/s to main memory, while the M5's doing it at ~153 GB/s. These numbers only get more absurd if you compare say M5 Pro vs a 9950X, where it's more like 300 GB/s vs that same 105 GB/sec.

One of the biggest performance issues with CPUs is keeping their CPU cores fed with memory. You can get some comically high performance numbers out of zen5 when the instructions fit in its L1 cache, but each layer it has to spill slows things down more and more, especially if it hits memory. Zen5 spills out of those caches much faster than Apple's CPUs do due to the significantly smaller sizes of those closer caches, and then has to fill them back from a smaller pipe with a longer first word latency than the apple part.

Apple could make that CPU run x86, PPC, RISC-V, or MIPS/LoongArch code without too many drastic changes and it'd still be a performance monster within its performance envelope. Very little of its performance is arm9-specific. But it also doesn't scale up super well. There's no way to make a 64-core Apple M5, while AMD's lego-blocks chiplet approach makes that fairly trivial to do (let alone 192-core like the EPYC 9965).

(Side note: I don't have an M5 yet, but I actually suspect there's some degenerate memory load profiles that exist where you'd actually run faster on zen5 out of its L3 cache while M5 has to spill to memory. I am also not sure how bad those would actually be, and for sure they'd be contrived and not represent real world use cases.)

[-]

R-ten-K@reddit

FWIW I think Apple has moved to a compute-chiplet now for some M5 SKUs. So perhaps they "could" do a many-core monster, but I doubt they even want to go into that market at all.

[-]

InfernoBlade@reddit

Yeah I believe you're right. Unfortunately Apple keeps things really close to the chest as to how they've implemented that. We don't know if it's more like infinity fabric (dedicated memory interconnect bus between CCDs), or more like foveros (big, wide interposer on the edge of dies), in terms of how its implemented. The former is definitely easier to scale out than the latter, but the latter is ultimately faster. Given how the M2 Ultra worked, I suspect Apple did something closer to foveros.

Apple historically has avoided anything non-uniform memory wise like the plague, and the IF approach AMD uses sort of inherently causes a NUMA access pattern for all the various L3 caches on each CCD. That also makes me think they just won't scale it up like that, because they'd have to spend a shitload of effort making XNU not fall on its face with scheduling and memory access patterns, as that kernel basically lacks any real NUMA support to speak of.

[-]

R-ten-K@reddit

I suspect it’s probably closer to a Foveros-style approach, though in practice more like SoIC given TSMC. A very wide interconnect to maximize bisection bandwidth. The chiplet split was likely driven more by yield and binning than by any real need for scalability.

Apple’s target is probably a tight pairing between the compute chiplet and the GPU/SoC die, so the priority is maximizing bandwidth between them rather than scaling out to larger multi-die configurations.

FWIW, even if Apple dislikes it, they’ve already had to deal with NUMA-like behavior in the M-series, especially in the Ultra parts. With multiple core clusters, distributed memory controllers, and contention from other high-bandwidth IP blocks, these modern high-performance SoCs end up exhibiting “NUMA-ish” characteristics whether they want to or not.

[-]

Geddagod@reddit

But it also doesn't scale up super well. There's no way to make a 64-core Apple M5, while AMD's lego-blocks chiplet approach makes that fairly trivial to do (let alone 192-core like the EPYC 9965).

What's stopping them?

[-]

onolide@reddit

I'm guessing many-core, monolithic processor would be super costly to make since each processor is huge, so natural defects during manufacturing would lead to more processors being useless(as compared to smaller low-core processors or chiplet-based processors, where each chiplet/processor is smaller and defects more likely leave a usable processor that can be binned instead of trashed).

[-]

1731799517@reddit

Nearly two deacdes of R&D focused on optimizing for battery powered devices as opposed to servers and PCs...

[-]

bazooka_penguin@reddit

Problem is they're outpacing the chips that spent 2 decades of R&D on high performance at peak single threaded performance and in plenty of mixed workloads too.

[-]

1731799517@reddit

Eh, that outpacing happens mostly by apple, and here the "they have more money than all others combined and have full control of the infrastructure" helps with the performance...

[-]

R-ten-K@reddit

There’s a well-established pattern: performance tends to follow economies of scale.

x86 vendors eventually surpassed traditional high-end RISC systems largely because they had the volume and revenue to out-invest them. Before that, high-end RISC machines had already overtaken the minis and mainframes that came before.

Deep pockets attract top talent, and top talent builds better designs, it’s a reinforcing loop.

That trend also tracks with integration. As more functionality moved onto a single chip, the advantage shifted towards those vendors. Today, mobile SoC players, operating at the highest levels of integration, often deliver the best per-core performance.

[-]

trmetroidmaniac@reddit

Lots of things, but high IPC is one of them.

Simplified, power scales linearly with IPC but quadratically with frequency. Apple's ARM CPUs have modest clock speeds but crazy good IPC. x86 CPUs tend to be narrower, instruction decode is a reason why.

[-]

onolide@reddit

power scales linearly with IPC but quadratically with frequency

Hmm I thot processor power draw scales linearly with clock frequency but quadratically with voltage?

[-]

Alarming-Elevator382@reddit

Apple’s CPUs don’t have modest clock speeds anymore, the M5 has a boost clock speed of 4.6GHz.

[-]

phire@reddit

And they suck down massive amounts of power when boosting to 4.6 GHz. Within the same ballpark of an x86 core booting to 4.6 GHz.

When we say "power efficiency" these days, we are mostly talking about battery life tests. And in most of those tests the CPU is basically idle for large chunks of time.

The two most common battery life benchmarks these days are video playback and web browsing.

With video playback, the CPU might only wake up once a frame or so and spends just enough time to send the audio and video off to dedicated hardware decoders and then point the video and audio DACs at the decoded result.
You aren't testing the CPU at that point. It's mostly a media engine test. And a good media engine can probably play a whole video without needing to wake the CPU once.

Or you might actually be benchmarking windows, and how often it wakes up the CPU for pointless unrelated background work.

Web browsing isn't that much better. The test is largely network bound, and while the layout and javascript might run on the CPU, it passes the bulk of the actual rendering work off to the GPU.

RISC vs CISC doesn't matter when the CPU isn't even running most of the time. ARM SoCs tend to do better in battery life tests because they were optimised to run in phones, where battery life is at a premium.

Most x86 CPUs are optimised for servers and desktops. Even the laptop versions of the chips are an afterthought (though Intel did put effort into optimising Luna Lake for laptops, and it actually did better than Apple's contemporary M3 chip)

[-]

Geddagod@reddit

And they suck down massive amounts of power when boosting to 4.6 GHz. Within the same ballpark of an x86 core booting to 4.6 GHz.

If this is true, then the problem becomes immediately apparent.

Every core from ARM or their custom designs all have higher IPC.

When we say "power efficiency" these days, we are mostly talking about battery life tests.

Idk about that. Single thread perf/watt is something that still gets talked about a decent amount, if not by the general public, in more niche forums like this one.

Qualcomm and Intel both also market single thread perf/watt curves.

[-]

Due-Cupcake-255@reddit

nor are intels chip straight up bad when it comes to efficiency. The lowend nucs need less power than a raspberry for example.

[-]

Boeing367-80@reddit

Ok, but IPC is efficiency, so isn't this just saying ARM chips are more efficient bc they're more efficient? I'm not trying to be snarky, I'm genuinely trying to understand.

If I asked Google, it tells me CISC vs RISC, but folks here are saying that's wrong.

[-]

Vivid-Software6136@reddit

ARM is designed from the ground up for lower power applications which is why its more efficient within those power envelopes. x86 was designed for servers and desktops where power efficiency wasn't the main priority. Boiling it down to CISC v RISC is only broadly correct because RISC architectures mostly were built to be efficient from the beginning. In the modern context RISC v CISC isnt a meaningful distinction at a technically level when comparing architectures.

All that goes double for Apple silicon specifically, they squeezed desktop class performance out of chips you can put into a phone because thats all they were doing, they werent trying to design chips to be scalable to server level. If intels primary market was mobile phones or sub 5W devices x86 would have evolved differently.

[-]

Geddagod@reddit

x86 was designed for servers and desktops where power efficiency wasn't the main priority

Server power per core and mobile power per core are similar. If anything, the P-core in smartphone chips actually boost and draw more power per core than flagship core count server chip.

Power efficiency prob isn't a large priority in desktop though, sure.

they squeezed desktop class performance out of chips you can put into a phone because thats all they were doing, they werent trying to design chips to be scalable to server level

What differences in design does this allow to explain the massive gap between Intel and Apple cores?

[-]

SirActionhaHAA@reddit

Server power per core and mobile power per core are similar.

No. Real mobile cores go <1w per core

the P-core in smartphone chips

Which is why they ain't all that efficient. It's usually the e cores that do more to account for the battery life.

[-]

Geddagod@reddit

No. Real mobile cores

What's a "real" mobile core?

go <1w per core, more like 0.6-0.8 actually. Dc cores go 2w

Intel and AMD cores can also go to extremely low power levels. Their Vmin is not so high that their minimum power is 2 watts.

And in server skus you do have cases where some cores will get power gated. Specpower2008 itself is designed to measure this scenario as well.

Which is why they ain't all that efficient.

The P-cores look more efficient once you go above \~0.8 watts vs the E-cores.

It's usually the e cores that do more to account for the battery life.

If LNL taught us anything it's that SOC design/power can contribute just as much if not more to battery life than the cores.

But sure, also, the E-cores can idle lower than the P-cores too.

[-]

SirActionhaHAA@reddit

Intel and AMD cores can also go to extremely low power levels. Their Vmin is not so high that their minimum power is 2 watts.

Key word, can also go, but they ain't designed to run optimally at those power levels. 9745 is 400w tdp 128c, even if ya assume that the io takes 100w the per core power is still 2.34w

[-]

Geddagod@reddit

9745 is 400w tdp 128c, even if ya assume that the io takes 100w the per core power is still 2.34w

The 8 elite gen 5 can consume up to 20 watts in GB6 nT. The 8 elite gen 5 has 8 cores, suggesting at least 2 watts per core, if not slightly more.

The Mediatek 9500 goes up to 18 watts in GB6 nT. The Mediatek 9500 also has 8 cores, suggesting again, 2 watts per core. Ofc the 4 C1 pro cores in this chip don't even look like they can consume 2 watts per core even at Fmax, so that also moves more of the power budget to the M and P cores.

And when you start arguing that these cores can go lower, don't forget that the apple e cores can go lower too.

Why is the comparison here Apple's E-cores?

Is that what you mean by "real" mobile cores?

Don't I specify "P-cores" in the second line of my original comment too?

Idk what's even to doubt here, amd has always said that their uarch and physical implementations are dc 1st targeting higher power levels and are reused in other markets.

Idk why you keep insisting DC is high power levels. DC is the lowest voltages on AMD's graphs along with APUs.

DC power per core is low.

Are we doubting their own claims?

AMD has never said that their cores in server skus consumes more power per core than the P cores or even M cores in phone skus lol.

A different core designed for different purpose, nothin wrong with that

There's plenty wrong with AMD cores performing worse or the same as ARM cores at significantly more power.

There's also plenty wrong with AMD cores drawing the same amount of power as the ARM cores and then performing significantly worse.

[-]

R-ten-K@reddit

ARM is designed from the ground up for lower power applications which is why its more efficient within those power envelopes.

It really is not. ARM was originally designed for desktop applications. That it found itself being used in low power/embedded designs is more of a historical accident (esp around its licensing model) than anything else really.

The "legacy" debt on a modern x86 is rounding error at this point (<2% of the total power/area budgets)

[-]

trmetroidmaniac@reddit

ISA is definitely a reason why ARM CPUs can be more efficient.

However, that ISA is not RISC. RISC is a way of designing an ISA based on what was microarchitecturally possible in the 80s and 90s. The reasons for doing things that way no longer exist. There's more to RISC design than simply "simple instructions" or "complex instructions".

Modern ARM emerged out of RISC and a few of RISC's good ideas remain. This is why I described it as "post-RISC".

[-]

theQuandary@reddit

What makes an instruction simple or complex?

Based on that metric, are ARM instructions actually complex?

[-]

trmetroidmaniac@reddit

Originally, these are some of the things that distinguish a RISC ISA from a CISC one.

A RISC instruction should take one cycle to execute.
A RISC instruction should be comparable to a micro-instruction on a CISC machine.
RISC instructions should be fixed length. Decoding RISC instructions should be simple.
Only load/store instructions should access memory, with few addressing modes. General instructions should not access memory. One instruction should perform one memory operation.
Use a large, uniform register file. Instructions can use any register.
Less focus is placed on convenience for handwritten assembly - rather, it is intended as a compile target for high level languages.
Certain microarchitectural details can be exposed if it makes it easier to achieve these goals. For example, delay slots on jumps and loads simplify the pipeline and make it more efficient.

Overall this makes the instruction set simpler. However, the last point certainly makes it more complex from a programmer's point of view. The purpose is to optimise the microarchitecture, not to just be simple.

32-bit ARM is mostly RISC:

Typically one instruction per cycle
Fixed length instructions with simple decoding
Only load/store instructions access memory
Pretty large register file. The program counter, link register and stack pointer are general registers.
Program counter reads are offset, exposing pipeline details.

But also has some CISCy features:

Almost all instructions can be made conditional with predication.
The barrel shifter is available on many instructions.
There are powerful addressing modes for load/store.
Load/store can perform multiple memory operations.
Thumb extensions add variable length instructions.

64 bit ARM has evolved considerably beyond this. Some of these changes are more RISC:

No more instruction predication
No more load/store multiple, only load/store pair
No more Thumb, instructions are fixed length again

Some of them are less RISC:

More instructions with variable latency. Many instructions will not complete in 1 cycle.
PC is no longer a general register. SP access is limited. Pipeline details are hidden.
Powerful branchless conditional instructions are available.

These changes were made not because of the original RISC design goals (get to 1 IPC on a scalar processor) but because they enable modern microarchitectures (maximise IPC on superscalar, OoO processors).

[-]

theQuandary@reddit

I agree with almost everything you wrote. I think though that the best view of what RISC creators intended is best determined from the CPUs they helped create. For this purpose, MIPS and SPARC (Berkeley RISC) should set the standard for what RISC means.

Use a large, uniform register file. Instructions can use any register.

I don't know of any RISC

MIPS has special registers HI/LO and PC. SPARC has Y, PSR, TBR, WIM, and ASR. You might also consider the zero register to be special (from an implementation perspective if not from an instruction perspective). RISC-V has fewer special registers in the base ISA, but the CSR spec allows for over 4k special registers.

A RISC instruction should take one cycle to execute.

Even early versions of MIPS used a separate pipeline for these instructions. If you look at M-series chips, you still see this. They always have a couple of ports which only take simple instructions and rely on other ports and OoO scheduling to work longer instructions around them. Likewise, the FPU always gets its own scheduler and basically functions as a co-processor just like original RISC did with unavoidable multi-cycle instructions.

Powerful branchless conditional instructions are available.

I'd note that branchless/conditional instructions actually reduce pipeline bubbles and don't take more instructions to execute. I think ARM got this right and that's proven by looking at the alternative in RISC-V (Zicond) which uses a second instruction and forces instruction fusion. This feels the opposite of what RISC intended.

PC is no longer a general register. SP access is limited. Pipeline details are hidden.

MIPS PC was only indirectly accessible from the start, so I'd say that definitely qualifies as RISC.

Stack Pointer is more interesting.

All the most common RISC have a register hardwired to zero. ARM makes this register dual special-purpose. Most RISC use a GPR, but SPARC (one of the 2 original RISC designs) only kinda makes it a GPR due to how register windows work.

Meanwhile, x86 (and I believe some other CISC) made the stack pointer into a GPR that had other special functions.

ARM's approach seems not quite either style, but probably more like what CISC systems were doing.

On the whole though, the differences between RISC inspired designs and CISC designs is very large even if we look at the latest generation (RISC-V and ARM aarch64) with a minimum of CISCy features and almost all RISCy features being present.

[-]

phire@reddit

I'd note that branchless/conditional instructions actually reduce pipeline bubbles and don't take more instructions to execute. I think ARM got this right

Conditional execution is a decent idea for in-order pipelines.

But the advantages don't carry over to modern in-order cores, conditional execution can actually reduce performance. There is a very reason why it was one of the many design features that weren't carried over to Arm64.

And not just because of the wasted decoding space, which massively hurt 32-bit ARM, requiring the creation of Thumb. Thumb didn't have conditional execution at all (other than branch instructions). Thumb 2 added the If-Then prefix instructions which is probably the better approach (especially for pipelines like Cortex-M33 which can dual-issue the If-Then prefix).

Conditional Execution is a bad idea for OoO pipelines simply because regular branch prediction can massively outperform it.
A conditionally executed instruction always takes 1 cycle to execute. And it always executes, even if the result is conditionally discarded. But a correctly predicted branch takes 0 cycles.

And modern branch predictors can be extremely accurate. With a 10 cycle penalty for miss predictions, the predictor would have to be wrong more than 10% of the time (and modern branch predictors often hit well over 99%) before the average cycle count of the branch goes above 1.0 cycles.

And it gets even better, using branches breaks dependency chains. Which can lead to the later instructions executing in parallel with (or even before) the instructions calculating the condition. The effect cycle count of the branch version can actually go negative compared to the conditional execution version.

And something that people always miss. OoO CPUs will often take that 10 cycle pipeline bubble out of order. They often end up completely hidden behind expensive cache misses (as long as they aren't dependant on the result) and a single cache miss could end up hiding multiple branch miss-predict bubbles.

[-]

DerpSenpai@reddit

Yes, ARM and x86 Instructions nowadays are a mixture of CISC and RISC. ARM however sheds instructions. ARMv9 is only compatible with 64 bit code. It means, the OLDEST code ARM can run nowadays was made in 2012. ARM constantly revises their ISA to make sure it doesn't hinder their CPU design for the future.

[-]

theQuandary@reddit

x86 core instructions (those 12 instructions that comprise 89% of code) are all extremely CISCy.

I'd agree that ARMv7 and v8 aarch32 are very CISCy, but those features were dropped when ARM/Apple started over from scratch.

If we have a scale from 1 to 10 where 1 is VAX and 10 is MIPS or SPARC (the two original RISC), where would the two most modern popular ISAs fall (ARM aarch64 and RISC-V)? Where would x86 fall?

[-]

R-ten-K@reddit

ISA is definitely a reason why ARM CPUs can be more efficient.

It actually is not.

Contrary to reddits misconception, RISC-y designs were not particularly good in terms of power efficiency, since they required much more instruction fetch pressure, for example.

The reason why ARM designs have been traditionally more "efficient" than x86 has more to do with the vendors that were implementing them. Players like QCOM and APPL had tremendously efficient cell designs for low power applications, for example. They could have used x86 front ends if they had the license, and they would have come up with better low power cores that either intel or AMD, simply because of the accrued expertise in that field of their design teams.

ISA an uArch have been decoupled for basically 3 decades now.

The last example of an ISA that had intrinsically poor power performance was Itanium, mostly for its requirement of predication.

But given the same cell libraries and similar functional back end and process node, an x86 and ARM designs tend to be fairly close in terms of their power/performance envelopes.

[-]

trmetroidmaniac@reddit

Backend is decoupled from ISA these days, frontend is not. Wide decode is one reason why Apple M series is as efficient as it is, but this is infeasible on x86.

[-]

R-ten-K@reddit

Absolutely not. Decode-width is not dependent on ISA, the decoders are orthogonal.

[-]

Due-Cupcake-255@reddit

IPC is a meaningless number when used across different architectures. It would be very simple to have an extremely high IPC with terrible real world performance.

[-]

Geddagod@reddit

Ok, but IPC is efficiency,

It's not. You can have a core that has higher IPC, but is less efficient at certain frequencies or power levels, or even across the entire power curve.

[-]

SoilMassive6850@reddit

How about the fact that companies designing ARM chips have traditionally focused on market segments that put a higher value on efficiency than the companies primarily making desktop and server chips.

Expecting the difference to come from the instruction set rather than the design of the entire chip is quite naive.

[-]

R-ten-K@reddit

This is actually the correct answer.

ARM ISA is not inherently more "power efficient." It is the vendors implementing the ARM cores, which have traditionally focused on mobile/embedded designs that created the effcient cores.

If QCOM et al had an x86 license, they would have created equally efficient cores.

[-]

onolide@reddit

ARM ISA is not inherently more "power efficient."

Yeah, a concrete example would be AmpereOne processors, which are ARM but have horrible idle power consumption like desktop x86 chips do(not sure the exact levels tho) cos those AmpereOne chips were designed for data center use, not mobile/embedded designs.

On the other end of the spectrum, Apple M Ultra processors are desktop class processors(at least in the Mac Pro), but have very low idle power consumption compared to x86 desktop processors.

[-]

theQuandary@reddit

Servers and laptops don't care about power efficiency?

[-]

zopiac@reddit

They care about raw performance and software compatibility more. Now some servers are migrating towards ARM since there are performant-enough processors on that arch, and laptops (are starting to) have decent enough translation layers (or ARM-native applications) to switch over.

[-]

Geddagod@reddit

They care about raw performance and software compatibility more.

If they cared about raw performance more, then you would see each core in server skus boosting way higher. Just look at how low server boost clocks are vs DT or mobile boost clocks.

The key is performance per watt.

[-]

zacker150@reddit

The key is performance per rack unit.

[-]

theQuandary@reddit

They care about raw performance and software compatibility more.

SPARC, IBM System/360, POWER, ARM, x86, MIPS, EPIC, Alpha, etc

Software compatibility is a lower consideration on servers than anywhere else which is shown by the proliferation of different ISAs still in use on servers (I skipped ones like PA-RISC or VAX that aren't used anymore).

Servers control their workloads and almost always have the source code. Linus (of Linux fame) stated that the reason x86 took over was because you could run/debug the software on your local machine easily and predicted that ARM would become much more popular in servers if ARM laptops were available. I believe he was correct.

x86 is struggling to keep up with Neoverse V3 on PPA even though it's based on X4 which is two (soon to be three) generations behind their best cores.

[-]

zopiac@reddit

I should have added "respectively". Of course servers can run whatever code they want to spin up, and I was focusing more on your second question regarding laptops by mentioning compatibility.

[-]

theQuandary@reddit

Saying that compatibility (x86) only happens at the expense of efficiency is quite a controversial statement on this subreddit.

[-]

zopiac@reddit

Interesting how I never said that. Not sure what your angle is, but I'm tired of my words being bent over it.

[-]

DerpSenpai@reddit

Servers are going ARM because there's nothing cloud native that is x86 only. Cloud is a relatively recent concept so everything has ARM and x86 builds. On Prem will continue to be way more x86

[-]

OkDimension8720@reddit

They all care about efficiency within their specific limits, but ARM based handheld devices have much smaller limits due to thermal constraints and power draw. When you work with a 250w server budget vs 8w handheld budget, the focus shifts a lot.

[-]

theQuandary@reddit

That 8w handheld is achieving better single-core performance than that 250w server chip. We're at a point where Apple's sub-1w E-cores achieve almost the same performance per core as an x86 chip with all cores under sustained load (not able to turbo).

Amazon has invested many billions into Graviton because it has a massive PPA advantage allowing them to pass some of those savings on to customers and undercut the competition.

[-]

SirActionhaHAA@reddit

Nothing much because arm ain't all that more efficient than x86. The efficiency of chips are not really decided by the isa, it's by the power target, node, and design quality

Apple chips are efficient because they have all 3 of those

Apple designs its chips for consumer workloads. This means disregarding extreme or high nt perf because their cores ain't needed in the dc space, they don't scale to 200cores. Apple cores target high ipc and st perf because that's what matters for mobile and thin premium devices
Apple always uses the latest and most efficient nodes because its products are premium and vertically integrated and command greater margins than most windows pcs
Apple designs good cores
Windows devices are traditionally 25+w, oems don't want 10w chips and so neither intel nor amd designs for those power levels. They can, but they are not because no oem customers wanted them

It's really not up to arm or x86. The isa accounts for probably just 2-3% difference in power which equals just a few minutes of battery life.

[-]

Geddagod@reddit

Nothing much because arm ain't all that more efficient than x86.

ARM cores are, though the ISA itself is not the reason why, sure.

They are, significantly, especially in 1T.

This means disregarding extreme or high nt perf because their cores ain't needed in the dc space, they don't scale to 200cores

Though there's nothing intrinsically keeping SMT away from ARM cores either (see Nvidia's new chip, and some auto ARM cores IIRC).

And SMT's cost for power and ST perf are very minor.

Apple always uses the latest and most efficient nodes because its products are premium and vertically integrated and command greater margins than most windows pcs

AMD's and Intel's cores don't look on par or better in 1T perf/watt iso node either. Which is even sadder considering they should have more modern designs iso node, as a result of moving to newer nodes slower.

Windows devices are traditionally 25+w, oems don't want 10w chips and so neither intel nor amd design for those power levels.

Problem is that ARM (apple, Qcomm) are designing up to those power levels, and they still look better than x86 in those power targets, in package/board power, per core.

[-]

golkeg@reddit

Ok, so what accounts for the greater power efficiency of ARM chips?

Lower operating voltage
Shorter pipeline

[-]

f3n2x@reddit

A more favorable voltage curve on an architecture designed to take maximum advantage of it, mostly. If you tune down Zen 4 X3D to like 30W, for example, the actual CPU cores (excluding IOD) are not that much more inefficient than Apple silicon.

[-]

Geddagod@reddit

If you tune down Zen 4 X3D to like 30W, for example, the actual CPU cores (excluding IOD) are not that much more inefficient than Apple silicon.

They are, whether you look at software power reporting, or board power (though the comparison there would be AMD mobile chips, not X3D stuff).

[-]

f3n2x@reddit

That's a cool story but I wasn't talking about non-X3D mobile parts. Zen 5 (the 4 was a typo) gets around 65% more power efficient in compute heavy tasks compared to stock when in the ~4GHz/0.85V range. And that's with the desktop-optimized IOD on an older node plus IF. The cores absolutely can throw punches with Apple silicon in the right environment.

[-]

monocasa@reddit

FJCVTZS is an example of an Aarch64 instruction that would have never been found in a classic RISC CPU, but is awesome for modern microarchitectures because it gets rid of costly branches and data dependencies.

People harp about this instruction because of the "Javascript" in the name, but it's a very RISC instruction in philosophy. The operation simply converts a double to a signed integer, but uses round to zero mode instead of the rounding mode in the FPU flags. The whole reason for that is because the default rounding modes in the ARM ABI are different than the x86 ABI, and for historical reasons, JavaScript specifies the x86 default of round to zero. This instruction allows ARM code to perform this operation without having to change rounding modes back and forth.

The impetus, design, and implementation of this instruction fulfills the core tenets of RISC. Half the idea was to look at your use cases and come up with simple instructions more or less on the order of what would have been microcode ops for a CISC machine for those instructions, and keep the other ideals around like 3 addr ops, alu ops are reg-reg and separate from ld/st ops, etc.

For instance, today we'd consider inclusion of BCD math as an example of the complexity of CISC. But back in the day, early RISC practitioners met up, discussed HP's inclusion of BCD ops in what would become PA-RISC, and agreed that in that context it was a very RISC implementation. They decided to implement those instructions because they saw a need for them in instruction traces of programs they wanted to run (think COBOL and other 80s and prior 'business' software). They implemented them in simple instructions that would be on the order of a typical single microcode op. They were three address, reg-reg ops.

https://www.yarchive.net/comp/bcd_instructions.html

[-]

phire@reddit

RISC wasn't really about making the instructions as simple as possible. It was about making the instruction decoder as simple as possible.

Contemporary CISC designs were dedicating massive amounts of die space to instruction decoding. If you take a look at an annotated 8086 die or 68000 die, about 2/3rds of the die is taken up by microcode or other decode PLAs, and the various bits of control logic. The actual register files and execution units are in a datapath which only take up a reasonably small slice of the die (along the left edge of the 8086 or the bottom edge of the 68000)

So if you actually focused on making the simplest viable control logic, you would have way more space on your die for other things. Like fast caches (or at least cache control logic), TLBs, more registers, bigger registers/ALUs or even a full barrel shifter.

Also, all that complex control logic, tricky PLAs and microcode ROM took a huge amount of engineering effort. Motorola and Intel had huge teams designing those CPUs, I wouldn't be surprised if instruction decoding and associated control logic took up way more than 2/3rds of the total engineering effort.
By simplify the control logic as much as possible, it suddenly became very viable for small engineering teams to make very competitive CPU designs.

Just take a look at the ARM1 die. Made by just 10 people, the datapath with registers, ALU and Barrel shifter takes up well over half the die.

The great thing about datapath is that you only need to design a one-bit wide slice, and then duplicate it 32 times.

TBH, I'm a little surprised how much die space ARM dedicated to the LDM/STM instructions. You could argue those instructions "aren't very RISC", because they take multiple cycles and require a very complex pop count and priority encoder. And it was one of the many design choices that didn't get carried over to Arm64. But ARM code did really benefit from them.

[-]

waitmarks@reddit

It's like my grandpappy used to always tell me, "You either die RISC or live long enough to see yourself become CISC"

[-]

jenny_905@reddit

It's complicated.

I remember this discussion with the AMD K5! Probably giving my age away there but that was the gist of it, more or less.

As far as I know the techniques it employed were replicated on most CPUs going forward.

[-]

phire@reddit

For the AMD K5 and Intel P6 (Pentium Pro, Pentium II, Pentium III) the "CISC converted to RISC" was kind of true.

AMDs marketing/documentation for the K5 explicitly said so, and both designs often acted in that way, a typical x86 read-modify-write instruction (like add [ebp+eax*4], ecx) could be split into three or four seperate RISC-like μops (Address calculation, read, modify, write). Especially the K5, where even add ecx, [ebp+eax*4] would be split into two μops.

But that approach wasted a lot of uops. It's just wasteful to completely split up every single x86 instruction into individual RISC-like μops.

AMD's K6 (which shares nothing with K5. AMD Acquired NexGen and rebranded their in-progress Nx686 design as K6) switched from "micro ops" to "macro ops", each of which was powerful enough to represent a full read-modify-write instruction and they went back to an execution pipeline which could do a memory op at both the top and bottom.
Some complex instructions would still need to be broken into multiple macro ops, but a massive subset of typical x86 code could be represented with just a single macro op. This design carries forwards to modern Zen processors too.

These macro ops are not very RISC like. They can each do a full x86 address calculation, a read, an ALU op and a write. And for modern Zen processors, they are actually documented as being variable length (the number of ops per cacheline in the op cache varies).

Intel went down a similar path. Even the original pentium pro supported doing address calculation, read and an ALU op in a single μop, it was only the read-modifiy-write instructions that needed four μops.
And for later designs (like the Pentium M) they introduced "fused μops", which is more or less the same thing as AMD's macro ops, not very RISC-like. When not being executed, they are stored internally in as a single fused μop. They only turn into multiple μops when dispatched to the execution units.

Essentially, while they might split complex operations up into smaller ops, those ops are not RISC. They are very much specialised for executing the more CISC like x86 code.

[-]

R-ten-K@reddit

FWIW, modern high-performance RISC designs also break down "public" instructions into simpler/specialized "internal" uOps.

The whole "RISC-like inside" was nonsense thrown by people on the internet on the usual game of telephone that are internet flamefests.

[-]

phire@reddit

The whole "RISC-like inside" came from the usual game of telephone that are internet flamefests.

The AMD K5 really didn't help the situation. The manual explicitly names the μcode "Risc86" and I really get the impression the AMD engineers were thinking along those lines too.
But it's not like the K5 was a good design.

On the other hand, the Intel P6 engineers were not thinking on those lines. Or at least Bob Colwell's Oral history (a senior architect for the P6) makes it very clear he saw the P6 as "revenge of the CISC".

And while there was a bit of a game of telephone, it's not really fair to blame it on the internet. It was a common description in print media and you even see it in academic work of the era.

Use of such language also started long before the P6, you see the 486 and P5 described as "risc-like", mostly because with their pipelining they started to match RISC CPUs. And optimisations guides talking about the "RISC-like subset of x86" which preforms quite a bit better on those CPUs.

The irony is that technique is more CISC-like than anything

I don't like calling it more CISC-like. The fact that we call the ops microops and most OoO designs include some kind of microcode ROM, it's not really the same kind of microcode that we saw in earlier CISC designs.

It's certainly not horizontal microcode. And while it might look somewhat like vertical microcode, it's very different IMO.

A large a part of the reason we keep seeing this RISC vs CISC debate showing up, is that we kind of stopped naming CPU architectures after RISC (other than VLIW). Many people try to frame it as a continuous CISC to RISC spectrum, and then place all modern OoO somewhere in the middle.

But really, OoO cores^[1] are their own thing and that thing isn't RISC or CISC. Their optimisation incentives are completely different. One of the great things about the OoO paradigm is that somewhat isolates the μarch design from the ISA it's implementing, which is why x86 originally switched to it; And probably why it never got a name, since (unlike RISC) it didn't require massive rethinks on how to design ISAs.

^[1] Or more modern OoO cores with their massive execution windows and full schedulers. Designs with some out-of-order like the early PowerPC designs lack schedulers and operate over such small windows that I'm not sure we should consider them to be OoO at all. At least not under the modern definition.

[-]

lally@reddit

The painful part for x86 is decoding. You generally want to decode a bunch of instructions at a time to get ahead of latencies. But x86 is variable length, so the second decoder has to mostly decode the first instruction again. The third decoder has to do both of the first two. Etc etc. Arm are fixed width, easier to do on risc. So you can decode as aggressively as you want.

[-]

R-ten-K@reddit

No. That is not how that works at all.

Parallel decoders on x86 are orthogonal. Just like on ARM or POWER.

Furthermore, just like on modern RISC machines, x86 cores are fully decoupled. Meaning the Fetch Engine is separated from the Execution Box. All the decoding "latencies" are abstracted out as far as the Functional Units are concerned.

[-]

lally@reddit

How does the second decoder know where to start decoding?

[-]

R-ten-K@reddit

the instruction buffer/queue takes care of that. the decoders automatically get their pointers from the RIP unit in the front of the fetch engine.

[-]

lally@reddit

So here's what happens. Your frontend for a single core wants to decode a few instructions in the next cycle. It's fetching bytes from the icache. They'll go into the reorder buffer for dependency analysis/speculative execution/parallel execution.

Here, have a look https://www.agner.org/optimize/microarchitecture.pdf in the descriptions of the various architectures' pipelines. E.g Zen 1-2 takes 16 bytes off of icache in a cycle and tries to decode up to 4 instructions from it.

In a single cycle, your first decoder decodes the instruction at offset 0 relative to where the last instructions were decoded in the prior cycle. But the offset for the second decoder has to do most of that work again, to get its offset. The third decoder has to figure out the first two instruction lengths to figure out where to start decoding, etc.

[-]

R-ten-K@reddit

I have no idea who this agner is

In any case, the front end in modern x86 cores augment instruction streams with preprocessed and cached metadata so the decoders only have to discover boundaries rarely,

the cache controller will store extra bits alongside cache lines as markers for instruction start, length, branch boundaries, etc. So the local instruction buffer doesn't have to parse everything.

similarly the predictor block will store the properly offset program counters in the RIP buffer.

instructions decoded are placed on the uOp cache, so no boundary checks are needed there as well. Same for the loop stream detector and the trace cache (if implemented).

for the rare occasions when the boundary check fails the decoder simply flushes its state and resends request to the prefetcher. But it is dealt with no different that an predictor/prefetch/cache/etc miss.

[-]

Wizard8086@reddit

Other have responded, but I'd like to add on why RISC is not necessairly better:y

What an unnecessary abstraction layer is can vary. RISC is still abstracted in the sense that it forces a set of basis instructions, but the guts of the cpu do not directly take risc instrictions. No one writes assembly, so we should just trust the compiler, and make it speak directly to the hardware, right? This is called VLIW (very long instruction word) and Intel tried this, two times (i860 and Itanium). Both failed spectacularly.

What is missing here is that you cannot just see a modern (read as in post late 90s) cpu as a simple calculator. CPUs are computers that analyze a problem in the form of instructions, find the most optimal way to solve it and parallelize it, using internal execution units. They even try to predict stuff, and they're good at it.

As such, what really matters is how good an ISA is for this. And we found that at the moment there really aren't big differences. x86 is harder in theory, mainly because of variable lenght instructions. But this has been proven to not be a problem.

Given this, in the future, code density might be the way to go, because bandwidth is expensive energy wise, and as such extremely important. RISC is bad at that.

[-]

m0rogfar@reddit

This... makes no sense.

What an unnecessary abstraction layer is can vary. RISC is still abstracted in the sense that it forces a set of basis instructions, but the guts of the cpu do not directly take risc instrictions. No one writes assembly, so we should just trust the compiler, and make it speak directly to the hardware, right? This is called VLIW (very long instruction word) and Intel tried this, two times (i860 and Itanium). Both failed spectacularly.

One, that's not at all what VLIW is. VLIW refers to bundling the instruction issue for all superscalar units into a single super-instruction, rather than making individual instructions for one scalar unit, and having the processor send them out to the execution units.

Two, i860 is not VLIW at all (it's a much more classic RISC design), and calling it a failure is also... questionable. Unlike Itanium, it was a complete success from an engineering point of view, being much faster than Intel's own x86 processors at the time. It failed only for business reasons - most major RISC buyers were already locked in (Sun had vertically integrated with SPARC, IBM had vertically integrated with POWER, DEC was late in development on Alpha, SGI had close partnership with MIPS that was already approaching acquisition), so there was no long-term anchor customer.

What is missing here is that you cannot just see a modern (read as in post late 90s) cpu as a simple calculator. CPUs are computers that analyze a problem in the form of instructions, find the most optimal way to solve it and parallelize it, using internal execution units. They even try to predict stuff, and they're good at it.

This is also nonsense.

The entire takeaway from VLIW was that more complex instructions -> less flexibility for OoO and branch prediction -> less ability find more optimal ways to execute code, while simple instructions -> more flexibility for OoO and branch prediction -> more ability to find more optimal ways to execute code.

If you want maximum throughput for perfectly ordered instructions, you need complex instructions and therefore VLIW. This is why GPUs are largely just VLIW-evolutions. If you want ability to analyze the instruction flow on the processor and solve it optimally and intelligently, you need simple instructions and therefore RISC. This is why all non-VLIW processors either use RISC directly or use RISC with a CISC emulation compatibility layer since the Pentium Pro.

As such, what really matters is how good an ISA is for this. And we found that at the moment there really aren't big differences. x86 is harder in theory, mainly because of variable lenght instructions. But this has been proven to not be a problem.

Saying it has been proven is a bit of a stretch I think. We still haven't had a x86 processor that had a performance lead that can't be attributed to a massive node advantage in the 38 years since SPARC was launched into the market with the 1987 Sun 4. Any time we've had RISC processors targeting performance leadership being made on even an n-1 node, they've held the lead.

[-]

R-ten-K@reddit

the i860 did operate in a "VLIW" mode where the compiler would flag a 2 int+fp instruction bundle that could be executed in parallel.

as a scalar application processor, the i860 was sort of a failure. It had very poor general performance for user code, since had atrocious pipeline stalls and servicing overhead for memory misses and interrupts. It was used mostly as a graphics/media accelerator operating, ironically, in its VLIW mode.

ISA and uArch have been decoupled since the 90s. Given a comparable node and functional resources, a CISC or RISC cores end up on very similar performance ballparks.

[-]

Wizard8086@reddit

"no sense" is a stretch. My comment is a simplified explanation to a person who doesn't know what a microcoded processor is, not a thesis in ISA design. Of course there are going to be asterisks.

About the VLIW issue, while it's true that's not VLIW is (my bad), that's what VLIW does in layman terms. As you can see, I also tried to convey that performance comes from parallelization and OOO, not intrinsically from the ISA implementation (not that much). The other comment also made a point about ia64 being aware of this.

I also think you're thinking about the i960 and not i860.

> Saying it has been proven [...]

What I meant there is that while people always talk about variable length instructions (more than CISC in general) being an issue for frontends, that's hasn't been an unsolvable problem so far. Of course now ARM seems to be taking the lead, but calling ARMv9 a RISC ISA is very debatable.

[-]

theQuandary@reddit

Unlike Itanium, it was a complete success from an engineering point of view, being much faster than Intel's own x86 processors at the time.

Are you thinking of the RISC i960 design?

i860 was a dog-slow VLIW design with some latencies running in the thousands of cycles. It saw only limited use in a few DSP jobs as I recall. It should have been the clear indication to Intel that EPIC was going to have some of the same issues.

The entire takeaway from VLIW was that more complex instructions -> less flexibility for OoO and branch prediction -> less ability find more optimal ways to execute code, while simple instructions -> more flexibility for OoO and branch prediction -> more ability to find more optimal ways to execute code.

This wasn't the findings of EPIC at all. The issue with EPIC was the "explicitly parallel" part. Compilers still suck at finding that parallelism 30 years later and likely always will.

What early Itanium lacked was the ability to analyze and adjust instruction flows in realtime -- a kind of hardware JIT. Later Itanium did exactly this by breaking down the instructions and executing the bundle one-at-a-time and the performance improved significantly.

There could still be advantages to an EPIC-style architecture if it's pushed through a traditional pipeline because you can give the CPU hints about parallelism before it starts its work and potentially make its job easier. I think the key here would be avoiding locking the number of instructions per bundle and shrinking bundle size from 128-bit to 64-bit to preserve code density.

This is why GPUs are largely just VLIW-evolutions.

If you look at GCN, it looks a lot more like a RISC with lots of SIMD instructions than a VLIW. Only with RDNA3 did it add in a couple of VLIW-2 instructions (and it sucked because it meant the second SIMD wasn't getting used properly). I think the future is more like Larabee without x86. Something like what TensTorrent is doing for AI with a tiny in-order RISC-V CPU paired with a comparatively massive 128-bit SIMD to do everything else.

We still haven't had a x86 processor that had a performance lead that can't be attributed to a massive node advantage in the 38 years since SPARC was launched into the market with the 1987 Sun 4.

100%

There's a reason why nobody has even considered an ISA anything like x86 for 40 years. Everyone can agree that there are features that make high-performance very hard to achieve (eg, branch delay slots), but when you mention that x86 suffers from these same kinds of performance roadblocks, its suddenly "different".

[-]

jaaval@reddit

For the main question: It is true but it has always been true. Or at least since memory speed had to be decoupled from cpu speed (long long time ago). RISC and CISC was never about what the CPUs do internally.

Forget about those terms, they mean pretty much nothing now. ARM has very complex instructions.

[-]

theQuandary@reddit

What do you consider to be complex instructions?

[-]

jaaval@reddit

Original definition I think was something like instructions that do more than a single micro operation and thus needed a micro op rom and sequencer. Lack of those was the big thing why risc was pushed early on.

Today the definition is meaningless. Even integer division will have to do a lot of operations and high performance requires fairly complex stuff for specialized math.

[-]

theQuandary@reddit

If RISC doesn't exist, then why does x86 still claim to be RISC under the hood?

I may be incorrect, but I think the best interpretation (based on the cores they designed) was "don't bubble the pipeline" which basically means keep it to 1-cycle or move it to a co-processor (this is what early MIPS did with stuff like multiply and later for floating point).

When you look at what Jim Keller calls the 8-10 most important instructions (just 12 instructions make up 89% of x86 code for example), you'll see that they all execute in a single cycle on ARM while most require multiple cycles on x86 (depending on the variant).

If you look at a modern chip, you'll also see this in play. There are always a couple of execution ports with only basic ALU functions. This is so they can always preserve that 1 cycle per instruction flow without bubbles. Next to them you get more complex ports with stuff like multiply or divide that take variable amounts of time (with reordering trying to force them through so they retire at the same time). FPU is still generally its own integrated co-processor with its own scheduler.

x86 MOV is so complex that it's actually Turing complete (meaning you could technically get rid of every other instruction and still have a functioning CPU). You won't find anything like it on ARM (nor any of the other instructions I mentioned).

[-]

jaaval@reddit

Who claims it’s “risc under the hood”?

No modern architecture executes instructions in one cycle. I think typical length of the instruction pipeline is 10-20 cycles. And as I said, there never was a difference in what is done at the actual operation execution stage. All CPUs just feed data from registers through some bus to some execution unit and store the result to some register. Operation in x86 doesn’t take any longer to execute than in ARM.

MOV is complex as an instruction definition. All the different versions of it are different instructions from CPU perspective.

[-]

theQuandary@reddit

You fundamentally misunderstand what is being discussed. Even the very first MIPS core used the classic 5 stages -- fetch, decode, execute, memory, writeback.

The instruction cycle count is considered how many cycles for the execute stage. Read the technical docs from AMD/Intel/ARM or Agner Fog and this is what they will all reference.

This number is important because of pipelining. Look at this example image.

https://en.algorithmica.org/hpc/pipelining/img/pipeline.png

Imagine it as a conveyor belt where people are building a car and each stage takes 1 minute to complete. Now you add in a stage in the middle where it sometimes takes 1 minute, but other times it takes 5 minutes or even 50 minutes to complete. When this happens, the entire conveyor stops while that stage finishes up its work and blocks everything behind it.

In a CPU, fetch, decode, memory, and writeback all generally take a fixed amount of time/cycles.

Most basic instructions like add are very fast (always 1 cycle).

In contrast x86 has complex stuff like FBSTP — Store BCD Integer and Pop which does this:

Converts the value in the ST(0) register to an 18-digit packed BCD integer, stores the result in the destination operand, and pops the register stack. If the source value is a non-integral value, it is rounded to an integer value, according to rounding mode specified by the RC field of the FPU control word. To pop the register stack, the processor marks the ST(0) register as empty and increments the stack pointer (TOP) by 1.

The destination operand specifies the address where the first byte destination value is to be stored. The BCD value (including its sign bit) requires 10 bytes of space in memory.

The following table shows the results obtained when storing various classes of numbers in packed BCD format.

ST(0)	DEST
− ∞ or Value Too Large for DEST Format	*
F≤−1	−D
−1 < F < -0	**
−0	−0
+0	+0
+ 0 < F < +1	**
F ≥ +1	+D
+ ∞ or Value Too Large for DEST Format	*
NaN	*

F Meansfinitefloating-pointvalue.

D Means packed-BCD number.

* Indicatesfloating-pointinvalid-operation(#IA)exception.

** ±0 or ±1, depending on the rounding mode.

If the converted value is too large for the destination format, or if the source operand is an ∞, SNaN, QNAN, or is in an unsupported format, an invalid-arithmetic-operand condition is signaled. If the invalid-operation exception is not masked, an invalid-arithmetic-operand exception (#IA) is generated and no value is stored in the destination operand. If the invalid-operation exception is masked, the packed BCD indefinite value is stored in memory.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

Here's a video on some of the "fun" of x86 MOV in its various arcane incarnations.

Can you see why this would take FOREVER to execute? It also uses a ton of transistors and the propagation delay lowers maximum clockspeed dramatically.

RISC made the observation that eliminating these meant you could make a smaller chip that ran faster.

You can try to solve the clockspeed issue by adding more pipeline stages (this is what Intel went crazy with in Pentium 4 and the cancelled Tejas/Jayhawk successors with up to 50 stages).

x86 actively tries to discourage use of all the worst CISC instructions because there is simply no way in the world to make them fast. Microcode is literally the process of encountering these instructions then running a small piece of software to generate a bunch of instructions which then get executed instead (if this sounds slow, it is).

Who claims it’s “risc under the hood”?

Google "is x86 risc under the hood" and look at the many, many results using that exact phrase. You can even search just Reddit and get tons of threads asserting this. Intel pushed this idea hard in their P6 technical marketing back in the mid 90s too.

[-]

jaaval@reddit

I am not misunderstanding anything. I already explained that there is no difference in execute stage in modern CPUs and no RISC architecture pretty much ever has done execute in one cycle, except for the operations that any system does in one cycle.

[-]

R-ten-K@reddit

Don't bother, any discussion w that guy almost always ends up in a bizarre emotionally charged word salad.

[-]

titanking4@reddit

RISC and CISC biggest differences right now is fixed-length vs variable length instruction encoding.

That adds dependency and makes parallel decoding a sequential problem. That one difference keeps x86 decode widths small around 4 for most designs, and 6 for high end designs. Both Intel and AMD thus use an “OpCache” which stores the “pre-decoded” instructions and that’s what actually feeds the wide backends. (74%-99% hit rate depending on the program)

ARM designs meanwhile can skip that process fully and feed the core from an 8-wide, or even 10-wide decoder trivially as the instruction boundaries and sizes are known already.

This tradeoff in instruction complexity means ARM code is less dense, which translates into high performance ARM designs using larger L1 instruction caches.

x86 have been sitting on a 32Kb instruction cache for pretty much forever even when needing to share across SMT threads, while Apple gives its cores 192Kb and even 128kb for its small ones for single thread.

[-]

Geddagod@reddit

x86 have been sitting on a 32Kb instruction cache for pretty much forever even when needing to share across SMT threads, while Apple gives its cores 192Kb and even 128kb for its small ones for single thread.

Intel has managed to move to a 64Kb L1i since redwood cove.

[-]

RealThanny@reddit

Both types of processor decode instructions coming into the actual instructions executed on the chip. CISC has more complex instructions to decode into many smaller instructions. RISC still has some complex instructions like that, but not as many.

In essence, CISC has more instruction decoding in the hardware, while RISC has more instruction decoding in software (i.e. at the compile stage). The actual instructions executed by the logic units are very similar in scope and count.

There's a notion that RISC is more power efficient than CISC, but that's completely incorrect. ARM is not more efficient than x86 given the same node and design goals. Apple's chips are more efficient due to persistent node superiority, due to the company paying through the nose to be on the most advanced node available at TSMC for many years now. When you look at ARM processors designed for the data center, they're not really more efficient than competing x86 processors, because they don't have a node advantage, and they're not designed for low-power applications.

[-]

Geddagod@reddit

Apple's chips are more efficient due to persistent node superiority, due to the company paying through the nose to be on the most advanced node available at TSMC for many years now.

They outright out design Intel and AMD for 1T perf/watt as well, even iso node.

When you look at ARM processors designed for the data center, they're not really more efficient than competing x86 processors

A couple things here.

First of all, source?

Second of all, ARM in house skus are plagued by having to adopt older ARM designs due to the gap between design and verification and launch (ARM's own server chips are supposed to improve this), lack of SMT (Nvidia server chips will have this), presumably weaker physical design teams (we saw how much of an impact this makes at Google), and worse uncore/interconnects.

[-]

EloquentPinguin@reddit

Its not really true that under the hood RISC or CISC CPUs operate in CISC-Like way. Modern RISC CPUs fuse instructions such that the ALU maybe does two simple arithmetic ops in one cycle. Or the Load/Store Unit gets an internal Op (often called micro op/uop) that has index, strobe, and offset such that it does the effective rege = regA+index*strobe to load an element from memory even though the RISC code might have two or three instructions for this.

The x86 looks different: while most used instructions are quite RISC-esque and do the same fusing and stuff, some instructions are way to complicated and need to be broken down into multiple uops. But ARM also has plenty of complex instructions that might need to be broken down into uops.

The mich bigger difference when talking about ARM and RISC-V vs x86 is that x86s instructions have usually a lot more side effects, stronger guarantees, and are variable length. While variable length is an issue it is mich less severe and mich less cisc related than the other issues: lots of effects per instruction and strong guarantees.

These two both make design space much harder compared to RISC-V for example as register ports are more contended and guarantees such as memory ordering impose limits on optimization in those areas.

[-]

R-ten-K@reddit

FWIW, most/all high performance architectures (x86, ARM, POWER) break up their instructions into uOps (or nanoOps).

Port pressure tends to be more correlated with issue width than instruction encoding part the fetch engine.

[-]

EloquentPinguin@reddit

What I meant with the port pressure is that many instructions in x86 write additional flags. While the instruction complexity is often not that much different having these flags often results in more writes per instruction.

[-]

R-ten-K@reddit

Not that I am aware.

(Disclosure: I have worked on both x86, ARM cores). In both we tended to arrive to designs with highly clustered registered files, with as little ports per reg as possible.

[-]

EloquentPinguin@reddit

Yes but for x86 an add instruction for example might set some flag registers in addition to the result register this requires for the instruction to be executed additional ports to be allocated for that write (and in a modern core probably have multiple physical registers of each flag) that wouldn't be required for an ARM core or am I missing something?

[-]

R-ten-K@reddit

Beyond the fetch stage, once the instructions are in the Fetch Engine, both x86 and ARM instructions are broken down into µops (or nano-ops).

In practice, instead of one complex instruction writing to a data register and updating condition flags simultaneously, the processor emits separate µops: one to write the result, and another to update the condition register. This significantly reduces port pressure.

Similarly we try to organize the register file into a banked structure, so that a lot of those register reads and writes are distributed horizontally. That trades fewer ports for slight increase in the scheduler.

[-]

Intrepid_Lecture@reddit

CISC was a design philosophy from a few decades ago centered on making due with the limits and constraints of the time.
RISC was also a design philosophy from a few decades ago centered on making due with the limits and constraints of the time.

Modern designs borrow many elements of both.

This argument has been dead for close to 30 years.

[-]

R-ten-K@reddit

The distinction is largely meaningless at this point.

ISA and microarchitecture have been decoupled for over three decades, and the decoder hasn’t been a primary performance limiter for just as long.

More importantly, modern CPUs have largely converged on highly speculative, out-of-order, superscalar designs. That’s where the vast majority of transistors, and the associated power and area budgets, are spent.

By comparison, the decoder is a relatively small piece of the puzzle, so defining an entire architecture around it doesn’t really make sense anymore.

[-]

CHAOSHACKER@reddit

You are in luck. Because both ARM and x86 do this translation of “ISA Instructions” to internal RISC-like instructions. Those are usually called micro-ops. The decoders do that job. They translate the ISAs instructions to what the CPU uses internally. Those internal ops are usually more optimized to that specific microarchitecture than the ISAs ops. x86 CPU have been doing that for 30 years now to get the benefits of instruction reordering and superscalar execution. Its an incredibly optimized process by now. I guess that ARM does it do to very similar reasons, considering ARM (1985) is just a couple years younger than x86 (1978).

[-]

DerpSenpai@reddit

Yes, the only difference between ARM and x86 philosophy wise is bloat and instruction length. ARMv9 is 64 bit ONLY, no support for older ARMv7 programs natively. Then instruction size is fixed so it makes the decoding work easier as it can predict always where instructions start and end.

[-]

CHAOSHACKER@reddit

True, thats why on newer ARM designs the decoders are significantly smaller in size.

[-]

R-ten-K@reddit

Not necessarily.

Take Apple’s latest high-performance cores, their fetch engines are actually larger than comparable AMD/Intel designs. The difference is where the pressure gets placed.

RISCy designs tend to push more work to the very front end, which means higher I-cache bandwidth and more expensive multiported structures. CISCy designs, on the other hand, shift some of that pressure downstream with mechanisms like trace or µop caches in the fetch/decode path.

There’s no free lunch. To sustain similar µop throughput into the execution engine, both approaches end up spending roughly comparable power and area budgets in the front end.

Also, decode hasn’t really been the primary bottleneck since the mid-’90s. Modern CPUs have largely converged on out-of-order, speculative, superscalar architectures, where most of the transistor budget now goes.

[-]

levelstar01@reddit

ARM (1985) is just a couple years younger than x86 (1978).

ARMv1 through ARMv3 bear essentially no resemblance to modern ARM, which started with the ARM7 in 1993.

[-]

VastTension6022@reddit

Arm64 is very different from 1985 arm and was designed in 2011.

[-]

theQuandary@reddit

ARM (1985) is just a couple years younger than x86 (1978).

8086 was based on 8080 (1974) which was based on decisions from 8008 (1972) which was itself somewhat informed by the 4004 (1971). The 4004 was the earliest public ISA on a integrated circuit and x86 inherited all its ignorance.

That period of the 1975-1985 was the explosion in advancements in all of computer engineering and computer science (and the computer science advances were a large part of what informed future hardware design).

[-]

CHAOSHACKER@reddit

The difference is that the 8086/8088 arent ISA compatible with the earlier ones. While they might have been informed by those designs, they clearly changed stuff that made them incompatible after all.

[-]

theQuandary@reddit

Yes, there were incompatibilities (mostly related to memory subsystems IIRC), but the core instructions were almost 100% compatible and Intel even offered conversion tools to auto-update your software from 8080/8085 to 8086.

[-]

nittanyofthings@reddit

CISC is not dead! Look at RISC-V processors. They often are built with application specific instructions to achieve great efficiency. This is the core idea of CISC. RISC actually just means keeping the base ISA as clean as possible, but your extensions can be complex.

[-]

FieldOfFox@reddit

CISC is (was) a paradigm to reduce program size.

From what I've calculated, e.g. Apple and Qualcomm Oryon have in fact gone the other way round... to get their performance, I think that the Arm RISC instructions are re-ordered and grouped into larger single instructions that can execute on a single cycle, instead of their "micro-ops" myth.

[-]

laffer1@reddit

Others have great answers but on the x86 side it died with the pentium pro for Intel.

[-]

Verne3k@reddit

Yes, it's true. It's way more nuanced but ultimately true

[-]

TwilightOmen@reddit

RISC architectures became CISCier over time by introducing additional specialized instructions to speed certain tasks up. CISC architectures became RISCier over time by allowing certain low priority tasks to be emulated by multiple simpler tasks. The "pure" definitions simply no longer exist. As concepts, they are well past their expiry date and have been obsolete for, to be honest, a very long time.

To sum things up, the two "extremes" no longer exist, and both approaches have moved somewhat to a middle ground. Now I hope you do not want us to go in detailed depth about instruction pipelines and micro-operations, and the above is enough.

[-]

schmerg-uk@reddit

Quoting comments I've made elsewhere... it gives the chip designer the freedom to completely redesign the internal private micro-op design for different models and different generations of chips while compilers etc continue to write out "standard" assembler... it's not quite the same as the JVM and a JIT but it's a comparison worth bearing in mind.

x86/x64 is actually a "CISC on RISC" chip these days... each of those instructions is broken down into one or micro-ops (more RISC like) and that's what actually scheduled, re-ordered, executed etc (And a single core has multiple execution units so it can typically dispatch or start up to 3 or 4 micro-ops in a single cycle, including 2 floating point multiply operations)

Even tho an AVX chip might have about 40 addressable registers (including 16 registers of 256bits each) each core actually has more like 200 registers that it uses at the micro-op level to store intermediary results as well as speculative results and it just renames these "hidden registers" to the "physical register" names (zero-ing a register is so common that the chip prefills hidden registers not in use with 0, and when it see an instruction to zero a register, it simply renames a pre-zero-ed register and doesn't even dispatch a micro-op... xor eax,eax takes literally 0 cycles to execute) and this is how hyper threading is implemented. A core interleaves the micro-ops from multiple threads and schedules the instructions and allocated underlying registers accordingly.

And each generation of chips (actually even different models within a generation) can have a completely different set of micro-ops and number of registers as long as they properly "emulate" the CISC operations. We never get to see what those micro ops are but there are ways to determine the size of the register files etc

I give a talk at work to other coders etc about how this stuff works and how they mostly don;t have to worry about it but it is all quite amazing

[-]

loozerr@reddit

I feel like this is looking for simple answers to complex questions. But for me a cpu is just some sand tricked into thinking so I'm not the one to answer.

[-]

AutoModerator@reddit

Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.