[Asianometry] Titanium: Intel's Great Successor

[-]

rchiwawa@reddit

If the AI companies wanted me to believe the hype, they'd figure out how to copmile (or make a compiler) that made the Itanium and its VILW instruction set absolutely rip

[-]

That's not something that can be done. Thanks to caches, memory access latency is unpredictable. Because of that, there is insufficient information available at compile time to match competent OoO scheduling.

HW scheduling was perhaps the right architecture for the 80's, and maybe for very early 90's. Transistor budgets were super tight, memory latency was not that out of whack with cpu speed (~2-15 cycles), and caching was very limited. In that environment, saving the transistors for more execution resources was maybe the right call. But the first Itanium CPU was released in 2001, when the x86 CPUs were clocking above 1.5GHz, main memory latency was ~300 CPU cycles, and every CPU with any pretensions for speed already had a two-tier cache system. At that point, HW scheduling was just dumb. All the transistor savings from simpler core architecture needed to be immediately spent on more cache, because when Itanium fundamentally needed higher hitrates than OoO architectures to be competitive with them.

[-]

R-ten-K@reddit

Itanium had massive caches, and the main challenge with software scheduling wasn’t really memory latency. IA64’s instruction bundles weren’t particularly “large” (and it wasn’t strictly VLIW), and many of the classic VLIW compiler issues had largely been addressed by the ’80s. A lot of the FUD around its compilers didn’t fully reflect reality.

In practice, Itanium 2 was more than competitive with the high-end OoO RISC chips of the time (AXP, PA, MIPS...), and it outperformed many in-order designs like SPARC. From a pure performance standpoint, the architecture had clear merit. There was also nothing inherent in IA64 that precluded out-of-order implementations, later designs were even expected to push further in that direction.

Where Itanium struggled was elsewhere: heavy reliance on predication, very large multi-ported register files, and the resulting power inefficiency. And like many high-end architectures of that era, it ran into the realities of design cost and economies of scale.

What Itanium ultimately showed is that there’s no “free lunch” in performance. It ended up using similar or larger transistor budgets to hit its targets. Not unlike today, where “simpler” RISC cores often grow as large as (or larger than) complex x86 designs to compete. The original goal of simplifying superscalar scheduling just shifted the complexity elsewhere, largely negating the intended savings.

[-]

Tuna-Fish2@reddit

the main challenge with software scheduling wasn’t really memory latency.

The main challenge with software scheduling is generally lack of MLP, which is made worse by higher memory latency.

Note that I am not talking about the size of the instructions here. The problem is that you cannot schedule well if you cannot schedule ops that depend on loads, without knowing which loads return first.

The massive caches were a bandaid that attempted to paper over this problem. The power use was largely caused by the L1 being made so fast to make scheduling easier. The architecture did not have merit, it sucked. The reason the chips were not complete failures is that this was pretty much the moment when intel's fabs shined the brightest, and they could fab chips better than anyone else. They used this to give Itanium roughly twice the transistor budgets that they gave to their high-end x86 server chips, or 4 times higher than what the low end got. When you consider Itanium iso-process, it's just bad. Had the RISC chips had Intel manufacturing, they would have smoked Itanium.

[-]

cp5184@reddit

itanic was 2-5 years behind on nodes. It was never going to succeed because it's transistors were four times bigger, four times slower and used four times more power than the competition at best. Intels 180nm and 130nm processes may not have been disasters but two-three years later they hadn't been transformed into something magically better by some pumpkin riding fairy.

[-]

R-ten-K@reddit

Again, this is a common misconception about what "software scheduling" meant in the context of a VLIW architecture.

They are not intended to solve/address MPL but ILP.

[-]

Better_Employee_7516@reddit

also VLIW may not have won for PCs but it's not a dead concept.
The company I work at uses our own VLIW architecture for our processors.

[-]

R-ten-K@reddit

True. Also most modern compilers still lean heavily on VLIW-derived techniques.

E.g. Loop and vectorization optimizers in LLVM and GCC make extensive use of VLIW ideas (aggressive loop unrolling, software pipelining, register allocation strategies, spill minimization, live-range splitting, etc). Control-flow techniques like profile guided optimization (PGO), block reordering, and hot/cold code splitting are directly descended from VLIW concepts like trace scheduling and hyperblock formation.

[-]

Tuna-Fish2@reddit

What? Yes, they were intended to solve ILP for ALU-heavy operations. The problem is that this was solving the wrong problem, and concentrating on ILP on very straight-line situations doomed the architecture into being horrible. Getting more ALU ops scheduled at the same time is uninteresting. The job of a good architecture is to get all loads started as early as possible, and to start working on any of the alu ops that depend on those loads as early as possible.

[-]

R-ten-K@reddit

The VLIW aspect of Itanium was primarily about exposing and exploiting ILP.

MLP, on the other hand, was handled through the rest of the memory subsystem; caches, prefetching, predictors, load/store buffers, etc. IA64 could sustain a large number of in-flight memory operations, overlap misses, and track dependencies between loads and stores.

So while ILP was largely a compile time concern, MLP was still very much supported by dedicated hardware structures.

[-]

cp5184@reddit

Also, intel all but abandoned itanium almost immediately. Initially it looks like it was 1-2 year and a half late to 180nm, and would stay on that for two generations until h203. Third gen itanium would be about two years late to 130nm and stay on it for five generations until h206... It would stay on 90nm for almost half a decade, move to 65nm in 2010, by now, itanium is half a decade behind anything else intel is making... Three years later it gets bumped to 32nm with a second 32nm generation before end of line...

Just by the nodes it was on it was never positioned to be a successful product. It was always two to five years behind xeon and it only got worse.

It may have started with good intentions to get it to catch up with xeon, but as soon as intel realized that it was basically only being bought by HP, and along with AMDs introduction of AMD64, intel lost any and all interest in itanium and basically relegated it to a forever legacy product, stringing hp along.

But it's not like HP's CEO wanted to turn HP into the itanic company either. HP's itanic line was just that, a forever legacy product. Even HP itself had no ambition for itanic to be a leading edge product in a wintel world.

[-]

rchiwawa@reddit

Truly, appreciate the insight.

[-]

symmetry81@reddit

The Mill Computing folks have (had?) a clever plan to re-specialize binaries at link time from a general format to the latencies and width of the particular chip a binary was running on, which can apparently be done in O(n) time.

Their website seems to be having trouble but if you've got time the YouTube lecture series was really interesting. They sadly seem to be a victim of trying to innovate in too many directions at once, including corporate structure of all things.

[-]

Tuna-Fish2@reddit

No, they were the victim of their core idea being unworkable.

The ability to respecialize for different chips doesn't fucking help, when the problem is not that chips are different, but the fact that memory access is unpredictable.

The mill could have been a nice idea in 1985 when memory access was effectively constant and ~6 cycles or something. Today, none of the ideas they proposed help with the problem that a memory access can take anywhere between 4 and 400 cycles, based on many things, some of which are unavailable to the compiler.

If your architecture cannot set two different load ops in motion, and then execute code to work on whichever data arrives first, you have chosen to build a slow CPU. No exceptions.

[-]

pi314156@reddit

Hexagon, which is an oddball between CPU (has a software-filled TLB but can still run full OSes) and accelerator survives as a VLIW to this day but with an important twist.

Data is DMAd in via a user-mode accessible DMA engine that uses virtual addressing to a TCM (VTCM in Qualcomm parlance, it's an 8MiB+ chunk of memory private to an Hexagon core) - its NVIDIA equivalent on GPUs would be the TMA.

And then data load/stores from compute kernels are from/to that VTCM which has predictable latency.

The 1st 4GiB of virtual memory is still accessible via conventional load store with the perf catches that implies, but memory mapped above that is only accessible through the DMA engine.

[-]

R-ten-K@reddit

FWIW, Hexagon isn’t that much of an oddball, it’s a fairly standard DSP block (now NPU).

VLIW, in general, has done quite well in highly regular, streaming workloads, which is why you see it in GPUs, DSPs, and similar accelerators.

[-]

pi314156@reddit

which is why you see it in GPUs, DSPs, and similar accelerators.

At least on the GPU side of things was a really long time since I saw VLIW GPUs. Last one I think was GeForce ULP on the Tegra 4 after which NVIDIA gave up on it - and AMD did so on the move to GCN.

[-]

R-ten-K@reddit

Yeah. modern GPUs have largely moved away from “pure” VLIW, towards more hybrid approaches. (FWIW "pure" VLIW were very rare to begin with, since most archs that we label as such are not that particularly long/wide)

A modern CUDA core has dynamic scheduling hardware, not nearly as aggressive as what you’d find in a high-end scalar OoO CPU. To compensate, the compiler does more work, things like instruction grouping and scheduling, so you end up with a mix of TLP + ILP + software scheduling/bundling/pipelining

It’s not VLIW in the strict sense, but it’s also not purely hardware driven scheduling either, more of a pragmatic middle ground.

You see similar trends in AI accelerators (esp NPUs), where VLIW-like approaches still make sense for highly regular workloads.

At this point, “pure” models like VLIW, SIMD, etc in isolation are rare. Most modern architectures are hybrids, with HW and SW each handling the parts they’re best suited for.

[-]

R-ten-K@reddit

The compiler worked just fine.

What killed itanium was cost (and power consumption), not lack of performance.

[-]

Tuna-Fish2@reddit

Firstly, the performance wasn't actually good on complex loads. See for example 4-socket TPC-C scores, >200k for Opteron (HP ProLiant DL585), 161k for Itanium (HP Integrity rx4640 (Madison 9M)).

Secondly, the only way that performance was achieved was huge, extremely low-latency caches. These are what lead to the high cost and power consumption. If you added those caches to contemporary x86 chips, they would have been much faster still. It was the architecture that sucked; it was kept on life support by heroic efforts on the manufacturing side. The die size of Madison 9M was 432mm², compared to contemporary Xeons at 131-269mm² and Opteron at 193mm².

[-]

R-ten-K@reddit

IA64 was competitive with the top-end architectures of its time at launch, on SPEC it was right up there with AXP and PA, which were the top OoO performers at that time.

Where it really lost ground was later. It got thoroughly outclassed by AMD Opteron and newer non-netburst Xeons, which also ended up overtaking the rest of the high-end RISC field. Those designs had much more aggressive out-of-order execution and speculation, benefited from a process node advantage (90 nm vs 130 nm), and critically had economies of scale on their side.

So what killed Itanium was what killed the rest of the high-end RISC machines. The performance was there, but the performance/cost ratio wasn't.

[-]

Tuna-Fish2@reddit

IA64 was competitive with the top-end architectures of its time at launch, on SPEC it was right up there with AXP and PA, which were the top OoO performers at that time.

It was carried by it's great performance on the very regular HPC-type loads where it could really shine, and which were way too high a proportion of SPEC at the time. Itanium scoring so high on SPEC while doing so badly on actual workloads is what led to many of the old simplistic microbenchmarks being eventually phased out from it. Don't look at SPEC, look at actually hard tasks, like database benchmarks or code compilation. And, er, on those Merced does not do well.

[-]

R-ten-K@reddit

FP-heavy workloads were the dominant target for that generation of high-end CPUs. Most RISC competitors prioritized FP as well, so it made sense for Intel/HP to optimize IA64 around those use cases. Afterall Itanium was supposed to be the successor for PA within HP.

Ironically, x86 microarchitectures were relatively weak in FP at the time, but that ended up being an advantage as the industry shifted toward commodity clusters running internet workloads, which were far more int-heavy. In that sense, Intel/HP effectively missed the rise of web-scale infrastructure, because IA64 was conceived in the early ’90s when “high-end” still meant technical/workstation and HPC workloads.

Intel’s x86 line succeeded there somewhat by accident, it turned out to be the right tool at the right time, even if that wasn’t the original target. (Eventually their luck ran out, and missed mobile and AI)

Also worth noting: Itanium wasn’t a primary reason why we transitioned to SPEC06.

[-]

Tuna-Fish2@reddit

At the time, TPC-C was more economically significant than all of HPC put together. HP and Intel spent more time trying to optimize databases on Itanium than they spent on HPC work. You are looking at the past with rose-colored glasses, Itanium was known to be bad on the key loads it was intended for when it first released.

[-]

R-ten-K@reddit

No. I’m just providing an objective, microarchitectural perspective: where these ideas and designs came from and why they existed.

I don’t have any attachment to a particular architecture, nor am I making emotional arguments for or against.

[-]

TwoCylToilet@reddit

It's not spelled Itanic. That's the ship.

It's Iluminium ;-)

[-]

mystandardusername@reddit

Not if you're in the US. Then it'd be Illuminum.

[-]

SwegulousRift@reddit (OP)

Minor spelling mistake wins again