I specialized in microprocessor design and low-level programming back in grad school, and I read a lot about VLIW designs. The processors are so apparently simple and enable so much inherent parallelism, that it was hard to imagine it wasn't going to become the future of computing.
But, of course, looking at the wires and transistors did hide the real source of the complexity: the compilers, which often needed to insert a bunch of synthetic instructions to compensate for poorly-scheduled parallel work, digging into any performance gains you might have had.
But the reality is that "worse" approaches eventually won out, VLIW couldn't keep up and things like SIMD and multi-core leveled the performance playing field in a way that wouldn't have been obvious back when Multiflow was pushing VLIW designs.
The compilers get all of the heat. But don't completely sleep on the "wires and transistors" side of things. The VLIW maximalists always wildly over-promised on actually manufacturing the things. Real world clock rates were much lower than what they were hoping for because the design approach had a much higher overhead than they had hoped. It wasn't until the very end of the 1980's that anybody had a single-chip VLIW, and that only did two instruction parallelism. If the design is so big you need complex multichip modules to make the processor, you can just have two processors instead and clock them independently.
It's all fine and good to say you can retire 8x as many instructions per clock cycle, but not so great if that winds up meaning you run at 1/8th the clock speed, and your CPU costs way more to manufacture. Even before you get to the fact that real world code doesn't come in tidy 8 instruction patterns that match your hardware, you are already cooked on cost effectiveness.
I do think there could be some sort of alternate universe where mid-80's VLIW actually took off, but they would have needed to do basically an order of magnitude better at actual execution. Something like VLIW probably would have been okay for specialist GPU-like accelerators where you only program them in specialized shader-style languages, and you use a more ordinary CPU as the host for general purpose code that just dispatches work to the accelerator. A much narrower domain would have narrowed the scope of requirements on both the hardware and the software side.
That part is definitely overlooked, for sure. MultiFlow was so completely focused on "simple hardware" that they probably didn't devote the engineering resources to chip design that they needed to, and there were some decisions in there about how interconnected things needed to be that made it hard to route all the necessary traces. It wouldn't have killed things too bad to say "we're doing 8 instructions at once, but instructions 5, 6, 7 and 8 had to come from a limited pool of opcodes" or something like that to reduce the number of interconnects.
But that's exactly what modern chips do: out-of-order execution where opcodes can be executed at the same time ONLY IF they hit different functional units and ONLY IF the hardware has determined that there's no data dependencies between them.
My understanding is that the memory wall killed performance due to larger binary sizes. It can outperform in a hot loop on micro-benchmarks, but once you start having to do anything else memory thrashing takes over.
As someone who did HPC research until 8 years ago, what stood out was how modern some of the ideas they had already in the 80s were in terms of compiler optimisations. Loop unrolling has become a “staple” optimisations that all compilers do, as is branch prediction (both hardware and software level).
Yeah modern CPUs deal with the same kind of issues internal to the processor itself: instruction reordering, branch prediction and speculative execution. Sure there are some pitfalls because of the complexity of modern chips and the occasional branch mis-prediction leading to a cache smash, but overall this approach has been shown to be "good enough" in all but the most specialized and demanding workflows.
wknight8111@reddit
I specialized in microprocessor design and low-level programming back in grad school, and I read a lot about VLIW designs. The processors are so apparently simple and enable so much inherent parallelism, that it was hard to imagine it wasn't going to become the future of computing.
But, of course, looking at the wires and transistors did hide the real source of the complexity: the compilers, which often needed to insert a bunch of synthetic instructions to compensate for poorly-scheduled parallel work, digging into any performance gains you might have had.
But the reality is that "worse" approaches eventually won out, VLIW couldn't keep up and things like SIMD and multi-core leveled the performance playing field in a way that wouldn't have been obvious back when Multiflow was pushing VLIW designs.
wrosecrans@reddit
The compilers get all of the heat. But don't completely sleep on the "wires and transistors" side of things. The VLIW maximalists always wildly over-promised on actually manufacturing the things. Real world clock rates were much lower than what they were hoping for because the design approach had a much higher overhead than they had hoped. It wasn't until the very end of the 1980's that anybody had a single-chip VLIW, and that only did two instruction parallelism. If the design is so big you need complex multichip modules to make the processor, you can just have two processors instead and clock them independently.
It's all fine and good to say you can retire 8x as many instructions per clock cycle, but not so great if that winds up meaning you run at 1/8th the clock speed, and your CPU costs way more to manufacture. Even before you get to the fact that real world code doesn't come in tidy 8 instruction patterns that match your hardware, you are already cooked on cost effectiveness.
I do think there could be some sort of alternate universe where mid-80's VLIW actually took off, but they would have needed to do basically an order of magnitude better at actual execution. Something like VLIW probably would have been okay for specialist GPU-like accelerators where you only program them in specialized shader-style languages, and you use a more ordinary CPU as the host for general purpose code that just dispatches work to the accelerator. A much narrower domain would have narrowed the scope of requirements on both the hardware and the software side.
wknight8111@reddit
That part is definitely overlooked, for sure. MultiFlow was so completely focused on "simple hardware" that they probably didn't devote the engineering resources to chip design that they needed to, and there were some decisions in there about how interconnected things needed to be that made it hard to route all the necessary traces. It wouldn't have killed things too bad to say "we're doing 8 instructions at once, but instructions 5, 6, 7 and 8 had to come from a limited pool of opcodes" or something like that to reduce the number of interconnects.
But that's exactly what modern chips do: out-of-order execution where opcodes can be executed at the same time ONLY IF they hit different functional units and ONLY IF the hardware has determined that there's no data dependencies between them.
indolering@reddit
My understanding is that the memory wall killed performance due to larger binary sizes. It can outperform in a hot loop on micro-benchmarks, but once you start having to do anything else memory thrashing takes over.
muellermichel@reddit (OP)
As someone who did HPC research until 8 years ago, what stood out was how modern some of the ideas they had already in the 80s were in terms of compiler optimisations. Loop unrolling has become a “staple” optimisations that all compilers do, as is branch prediction (both hardware and software level).
wknight8111@reddit
Yeah modern CPUs deal with the same kind of issues internal to the processor itself: instruction reordering, branch prediction and speculative execution. Sure there are some pitfalls because of the complexity of modern chips and the occasional branch mis-prediction leading to a cache smash, but overall this approach has been shown to be "good enough" in all but the most specialized and demanding workflows.
bzbub2@reddit
good channel in general. lot of varied topics but always interesting
xampl9@reddit
I like that he does his own research. So many of these just regurgitate Wikipedia.
+1, recommended
juhotuho10@reddit
Honestly, one of the best technical content on youtube and some weird and interesting topics in the mix
Farhaan_1120@reddit
Amazing video
DeGamiesaiKaiSy@reddit
An amazing story with superb story telling