The costs of the i386 to x86-64 upgrade

[-]

plugwash@reddit

FYI this is the second time this article has been posted to this subreddit

The previous discussion was at https://www.reddit.com/r/programming/comments/1fyoj79/the_costs_of_the_i386_to_x8664_upgrade/

[-]

d64@reddit

Author, if you see this: when you said clutches, did you mean crutches?

[-]

jmmv@reddit

Heh, I suppose so! I didn't even know what "clutches" meant until I looked now, but because it's a valid word, I didn't notice the typo. Fixed.

[-]

KittensInc@reddit

The article misses one critical point when talking about RISC vs CISC: decode complexity!

Binary size doesn't really matter when your instruction fetcher is basically never bandwidth-limited. Modern memory is fast enough that it's just not a bottleneck anymore. However, there's still a rather large cache miss latency - so the better your cache prediction and prefetching is, the less likely it is that your core has to pause execution and wait for the memory read to finish. And what makes that easier? Instructions which are easier to decode.

If every single instruction is exactly 32 bits, it is pretty trivial to decode. There's only one possibility, so you know exactly which memory accesses that instruction is going to do, and you're pretty sure what instruction comes next. You can do speculation and start the memory read before it is even needed, so it'll be ready when the core starts executing that instruction. But if your instructions can be 8 bits, 16 bits, 24 bits, and 32 bits, you suddenly have to do 16 times as many decodes. That's way more expensive, so you can't look as far into the future, so you're going to stall more often.

Sure, you might save 5% in disk space, but what's the point if your code runs 10% slower?

[-]

BookinCookie@reddit

The issue of prefetching variable-length instructions has long been solved, and doesn’t impact performance much anymore. Modern instruction cache prefetching is directed by the branch predictor, and solutions for high-performance branch prediction on CISC have existed for years. Decoding is a separate issue, but there’s been great solutions for that as well.

[-]

RussianMadMan@reddit

Imho, article downplays how much the increase in registers and subsequent calling convention change increases performance. Even in x86 there where "fastcall" conventions that allowed passing 2 arguments via registers, and now its 4 on windows and 6 on linux

[-]

Tringi@reddit

The calling convention factors for a lot of performance improvement, but it could've been much better as it's currently known to be hindering some large codebase modernization efforts, e.g. replacing pointer+length pair of parameters with views or spans, or using std::optional.

[-]

UsedSquirrel@reddit

The C++ committee basically did absolutely nothing between 1998 and 2011. And calling for ABI changes to accommodate C++20 features? That's way too late, as is usual for anything C++ these days.

[-]

RussianMadMan@reddit

The best way to modernize c++ codebase - stop writing in c++. This proposal is just another bandaid solution to a problem created by another bandaid solution. Which also adds more binary incompatibility into the language.

[-]

UsedSquirrel@reddit

Floating point values also used to be passed through memory back when the x87 floating point stack was common. The 64 bit transition was used as an opportunity to switch to XMM fp registers.

[-]

Revolutionary_Ad7262@reddit

More registers and better calling conventions are just due to newer and better achitecture, not due to we have 64 bits now.

There is a https://en.wikipedia.org/wiki/X32_ABI , but unfortunetly it is pretty obscure and I think the tooling and ecosystem around C/C++ are main reason for that

[-]

RussianMadMan@reddit

I was questioning reasoning in the article itself, it mentions calling conventions, but only in the light of producing smaller code, while ignoring obvious (and much bigger imho) benefit in speed from using registers to pass arguments.

[-]

Revolutionary_Ad7262@reddit

I don't get it. Article clearly states that x64 is better than x86 (except variables may be larger) and you can have both goodies with x32

[-]

RussianMadMan@reddit

In wiki page you linked, most modern benchmark is from 2011 and it is in single digit percents benefits and not always. Seems like extra work for little to no gain.
Also, can x32 code call x64 libraries? If not you would need to have whole other userland on linux starting with libc and going up.

[-]

Tringi@reddit

Single digit percent improvement is not small.

It can amount to thousands of dollars or tons of CO₂ depending the scale.

[-]

Revolutionary_Ad7262@reddit

If not you would need to have whole other userland on linux starting with libc and going up.

Yes, that is why I said C/C++ ecosystem is responsbile for that. In a normal world (like in Rust or Go) you can switch architecture with a single CLI flag, because code is writen safely as well as you build whole depedencecy tree from source

[-]

RussianMadMan@reddit

Rust depends on the libc, so its gonna have the same problem. Go does not tho. But on a rare occasion you would need to call native library from go it would suck.

[-]

ClownPFart@reddit

The point about code taking more space is extremely moot when people routinely develop apps using electron. Before caring about machine code density perhaps stop dragging in an entire web browser to display even the simplest of uis

[-]

TA_DR@reddit

People caring about machine code density are most definitely not build electron apps.

[-]

RussianMadMan@reddit

Size of executable does not matter this much. What matters is how much of an actual code CPU can "see", for example, whether or not the whole of the hot loop can fit into that "see" window. So it matters more into what JS is JIT compiled into, rather than chromium size itself.

[-]

ClownPFart@reddit

There's also the billion of crappy layers that make up the entire web dev stack before anything is rendered on the screen. Not to mention that even using an interpreted language is stupid in the first place. There's a lot more brain damage in the entire web stack than just JS or its JIT.

[-]

RussianMadMan@reddit

JS is not an interpreted language, it is JIT compiled.
20% of code runs for 80% of runtime. How many layers of web dev stack does not matter, because a lot of code is run just once per page or once per DOM update. But render itself is a tight hot loop that already has all the data.

[-]

PangolinZestyclose30@reddit

The point about code taking more space is extremely moot when people routinely develop apps using electron.

So people should just stop optimizing apps because some other people write slow unoptimized apps? Talk about a moot point ...

[-]

water_bottle_goggles@reddit

no

[-]

UsedSquirrel@reddit

This author doesn't seem to familiar with the x86 ISA. LP64 is a much more obvious choice for x86 than a generic instruction set.

Everything that uses a 64 byte register requires an extra encoding byte called the rex prefix (it was the only backward compatible way to extend x86 encoding). Song the penalty for ILP64 is very high.

On x64 as designed, if you do a 32 bit add, it auto zeros out the top 32 bits of a register so you can do 32 bit arithmetic with no penalty if you don't need it. So LP64 can win back some code size losses.

[-]

SkoomaDentist@reddit

There is essentially only one major flaw in x86 ISA and that's the very cryptic instruction encoding where instructions can have a semi-arbitrary number of prefixes and the length is extremely variable (and hard to calculate without largely decoding the entire instruction).

It still baffles me why AMD didn't fix that by streamlining the instruction lengths when they designed the x64 ISA and already had to change many instructions.

[-]

GwanTheSwans@reddit

Well, there's still a finite upper bound declared by fiat (15 bytes). While the encoding is very strange (coming from something very much saner like m68k) and would seem to allow indefinitely long encoded instructions, generally modern x86-64s just don't officially support more than 15 bytes per encoded instruction. https://stackoverflow.com/a/14698540

You can construct instructions that would encode to more than 15 bytes, but such instructions would be illegal and would probably not execute.

Ideally x86/x86-64 would also reliably fault when encountering an instruction that still hasn't finished after 15 bytes, but ISTR that's not always the case for old chips.

[-]

SkoomaDentist@reddit

The upper bound doesn't help with the real problem which is instruction decoding and specifically quickly determining where each instruction starts.

[-]

ShinyHappyREM@reddit

(it was the only backward compatible way to extend x86 encoding)

Another way would be storing the current processor mode (32-/64-bit general-purpose registers and/or address registers) in a separate "hidden" register, just like the WDC 65c816 achieved backwards-compatibility ("emulation mode") with the MOS 6502 CPU.

Of course the disadvantage is that debugging becomes a bit more complicated.

[-]

shevy-java@reddit

Would be kind of great if we could upgrade hardware without having to buy new hardware. Kind of like universal 3D printing. Evidently we need to be able to manipulate as few atoms as possible, but that should in theory be possible to some extent (you can use an atomic force microscopy to "reposition" atoms, for instance; obviously cheap 3D printing on the nanoscale level isn't available right now but perhaps one day it will be. Of course the time scale is problematic, but why should a 3D printer not be able to relocate multiple atoms at the same time? Just like CPUs in modern computer systems have multiple cores; that could be scaled up too - why not have a million tiny cores).

[-]

ParCRush@reddit

Lol you can buy an FPGA if you like.