How Optimizations made Mario 64 SLOWER

[-]

BlueGoliath@reddit

TL;DW: N64 is extremely memory bandwidth starved so undoing optimizations that trade bandwidth for less CPU cycles tend to net incremental performance boosts.

[-]

This article makes no sense. Literally, you can't have an optimization that makes a game slower. This is, literally, the opposite of an optimization. A code change that slows the game down can be called many things, but 'optimization' is not one of them.

[-]

neuralbeans@reddit

That's still an optimisation though. Just optimising data transfer instead of CPU.

[-]

levodelellis@reddit (OP)

I would call that going towards baseline so I don't think it should be considered an optimization

[-]

ProbsNotManBearPig@reddit

Optimization requires defining a quantitative criteria for evaluation. It’s not subjective. Anything that improves fps is an optimization and you can’t just move that goal post because the way to get there is simplicity. Formally, what is described in the OP is an optimization.

[-]

levodelellis@reddit (OP)

If you said this to me in a job setting I'd fire you

Optimizations can and often do slow other parts of the code down

[-]

Brayneeah@reddit

What you describe is actually not an uncommon optimisation that compilers make! (if they can verify that doing so won't change a program's results)

[-]

levodelellis@reddit (OP)

Compilers don't do that. Unless you ignored the thread and think I'm talking about dead code optimization like that other guy. Don't make up stuff

[-]

ehaliewicz@reddit

I just wrote a simple compiler that rolls up loops like this just for fun 🤷

[-]

levodelellis@reddit (OP)

What compiler/language is that? If you like optimizations maybe you're interested in looking at SLP. I think that's what replaced reroll https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp

[-]

levodelellis@reddit (OP)

I imagine all the implementations would unroll again? It seels like reroll was removed https://github.com/llvm/llvm-project/pull/80972

[-]

ehaliewicz@reddit

If it was done specifically to reduce code size, I don't see why they would.

[-]

levodelellis@reddit (OP)

maybe but I heard it was to reroll for BLAS

[-]

ehaliewicz@reddit

https://reviews.llvm.org/D1940 Yes, specifically to reduce code size, and sometimes to allow the compiler to choose how much unrolling is appropriate for the hardware, including zero times.

[-]

levodelellis@reddit (OP)

There's also SLP but IIRC that's more like instruction combining than rolling up the loop

[-]

ehaliewicz@reddit

Any semantics preserving transformation that improves performance can be called an optimization and disagreeing seems pretty silly.

[-]

ehaliewicz@reddit

Why does it matter what you call it? Are you an authority on the definition or something?

[-]

LookIPickedAUsername@reddit

I don’t care what you call it, but I do care what compiler experts call it. And they call it an optimization. It’s called “dead code removal” and is a standard optimization pass in all modern compilers.

[-]

levodelellis@reddit (OP)

I am a compiler expert... and that's not dead code

[-]

LookIPickedAUsername@reddit

I was just responding to what you said, which was that you’d never call removing code an optimization. Maybe you meant “functional code”, but that’s not what you said.

[-]

levodelellis@reddit (OP)

So do you in fact care about what I call it and what I say?

I clearly said deleting code and I said "what the video talks about"

[-]

levodelellis@reddit (OP)

I know this is a big ask but reddit, quit being stupid. Check my submission history.

[-]

Jarpunter@reddit

Removing functional code is literally not dead code removal

[-]

LookIPickedAUsername@reddit

Where did I suggest otherwise?

[-]

Jarpunter@reddit

I mean that’s the entire premise of this conversation unless you just did that redditor thing where you take one statement in a total vacuum without any consideration for its context in order to make an “erm acktually” comment that nobody appreciates.

[-]

castthisaway5839@reddit

Obviously anything that hurts performance isn't an "optimization" by definition. And anything that achieves the same original goal while improving performance is.

So clearly when Kaze's saying "optimizations made Mario 64 SLOWER", it's implied shorthand for "code added in an *attempt* to optimize, and would have been optimizations under other circumstances, actually hurt performance."

Everyone knows by definition a pedantic, literal interpretation doesn't make sense. And everyone can understand what Kaze is communicating, and the tradeoff is that it works well as a terse, eye-catching title (without being misleading).

So it's kind of dumb that someone ITT is "well actually"-ing the obvious. OP also isn't phrasing things very well, but he's spiritually trying to say is "saying you're 'optimizing' by just `git revert`ing the failed experiment someone put in is like saying you're 'cooking' by picking out the olives someone put in your pasta.".

Like... yes, you could certainly say that, but the point is getting lost in the pedantry. The main point was that it's super interesting that Mario 64 code contains a number of seemingly tacked-on, failed experiments that can be easily effectively naively `git revert`ed back to a better-performing baseline.

This work by Kaze is very qualitatively different than the other efforts he's put into optimization in that it almost feels like "anti-work", and "well actually"-ing a point I think literally everyone in this thread actually already understands is just noise, mostly.

[-]

KingJeff314@reddit

"Attempted optimizations" if you want to be pedantic. The point is, they had functional code, then tried to rewrite it to squeeze performance out, but it was counterproductive to their goal

[-]

mr_birkenblatt@reddit

Yeah, but the developers put the cpu focused optimizations in by themselves. This video removes this extra code which results in a speedup

[-]

Mynameismikek@reddit

A big bit of me questions whether it was at the end of dev, or done way too early. Some of these felt a bit to me like SNES-era "we just do it that way" optimisations rather than something that came from actually having hardware in front of you.

[-]

ShinyHappyREM@reddit

IIRC back in the day it was reported in some magazines that the game developers had to simulate the console on their SGI workstations before they had access to real ones.

[-]

Mynameismikek@reddit

yeah - the early dev hardware was a massive SGI Onyx. You'd see them trotted out for marketing gigs - "you'll get this $100k workstation in a games console!" Later kits ran on the much smaller Indy or even a PC.

Thats kinda why I think this was done early on. The Onyx was a bandwidth beast, but they also knew the CPU and graphics hardware was much more capable than what would end up in the final console, so they optimised for what they knew would get reduced.

[-]

mortaneous@reddit

Man, I miss SGI, some of their hardware was damaged sexy back in the day.

[-]

Murky-Relation481@reddit

was damaged sexy back in the day

You could say that damage made them do some RISC-y things.

[-]

Jamie_1318@reddit

The game was decompiled, the original source isn't available. There isn't a way to tell which the developers wrote, and which the optimizer added.

[-]

mr_birkenblatt@reddit

The devs compiled in debug mode. The compiler didn't add any optimizations

[-]

double-you@reddit

Is somebody using "optimization" wrong somewhere or what are claim are you attempting to correct here?

[-]

neuralbeans@reddit

I guess the issue is that you can't say that the Mario64 developers were doing optimisations if the system became slower.

[-]

falconfetus8@reddit

They thought they were making it faster, at the very least. It's an attempted optimization.

[-]

JaggedMetalOs@reddit

I mean it's actually optimizing for the hardware vs the standard programming optimizations you learn in CS.

[-]

sammymammy2@reddit

No. The data transfer is so slow that the CPU is stalled while waiting for instructions.

[-]

levodelellis@reddit (OP)

Outline (the opposite of inline) blew my mind

[-]

player2@reddit

Still a very relevant (de-)optimization today. If you have a loop with a condition that is not usually taken, outlining the not-taken branch might help the hot path fit into a single cache line. If the branch predictor can correctly predict the cold path isn’t taken, it won’t prefetch those instructions and your loop will execute entirely out of L1 instruction cache.

[-]

kesawulf@reddit

You're a compiler expert and outlining blew your mind?

[-]

BlueGoliath@reddit

I'm not entirely sure how these save bandwidth, especially with like loop rolling.

[-]

gingingingingy@reddit

It's more like the code takes up less space so less bandwidth has to be used on moving the code into cache

[-]

BlueGoliath@reddit

What "code"? It's instructions.

[-]

artofthenunchaku@reddit

Inlining code leads to more instructions in the binary overall, while improving performance by reducing the instructions for an individual function call (there's more to it, but this is the relevant part). It's a tradeoff between CPU performance and memory usage.

[-]

BlueGoliath@reddit

I'm aware but people are referring to two different things as if they were the same. They aren't.

[-]

artofthenunchaku@reddit

Are we really arguing the semantics of "code" vs "instructions"?

Good Lord

[-]

BlueGoliath@reddit

It isn't semantics. User facing code model and actual instructions are likely to be different, especially when optimizations come into play.

[-]

glacialthinker@reddit

Your overly picky distinction was confusing to me, leading me to follow this subthread to dispel my confusion... because I grew up with code being various kinds of assembler mnemonics, which were 1:1 mappings to instructions. That is, I had no problem understanding what they meant by use of the word "code", even though for you it might imply a higher level language.

[-]

BlueGoliath@reddit

At no point was assembler mnemonics shown in the video. It was all clearly C code. Watch the video or get your eyes checked.

[-]

stylist-trend@reddit

Watch the video or get your eyes checked.

You are absolutely insufferable.

Instead of just mentally substituting "code" with "instructions" and moving on with your life, you have to put a ton of effort into this giant tirade to prove your... intellectual superiority?

Or is it that you had nothing to add about more instructions == more bandwidth, and you just needed something useless to respond with?

[-]

lx45803@reddit

This is reddit

[-]

levodelellis@reddit (OP)

The guy is correct. When you roll up loops there's less instructions. The cache is tiny so it appears that the game would constantly move instructions in and out of the cache

[-]

gingingingingy@reddit

The instructions still have to be stored somewhere as code which is going to take up space/memory in cache.

[-]

player2@reddit

The instructions need to be transferred from cartridge ROM to main memory before the CPU can access them.

[-]

UncleMeat11@reddit

It's not a completely uncommon technique in the broader compilers space.

[-]

uCodeSherpa@reddit

For the record, it isn’t just N64. On modern hardware, just recalculating things is frequently faster than caching them.

[-]

aanzeijar@reddit

That's also true for modern CPUs if anyone wonders. It used to be that you would unroll loops to save the loop overhead cycles. Nowadays though memory is so much slower than CPUs are that loading less code can be faster than saving a few cycles.

[-]

levodelellis@reddit (OP)

For context: Back then people were programming in assembly for SNES games (mario was first 64 game). People wrote 'optimizations' by hand since that's what you did when you write assembly. C optimizers were somewhat buggy so they werent use, hence why the devs did some by hand

[-]

vinciblechunk@reddit

The MIPS CPU in the N64 had an extremely mature compiler ecosystem thanks to the SGI pedigree while the 65c816 core in the SNES was an absolute bitch and a half

[-]

levodelellis@reddit (OP)

Did you program for the 65c816? Someone (outside of reddit) linked me to this. Maybe optimizations wasn't used bc they didn't use SGI workstations and used the gcc compiler which wasn't as trustworthy?
https://old.reddit.com/r/gamedev/comments/8wf7e0/what_were_ps1_and_n64_games_written_in/e1voug9/

[-]

vinciblechunk@reddit

If you're truly trying to get to the bottom of why SM64 shipped without compiler optimizations, you might get some insights from the people involved in the decompilation project.

[-]

vinciblechunk@reddit

65c816 not professionally, have dabbled.

Most of what that guy is saying in that comment tracks. GCC prior to 3.0 was pretty rudimentary and bugs in the optimizer were probably not out of the realm of possibility. N64 being a MIPS target, you did have a choice of several different compilers. I don't know a lot about the SN Systems GCC fork other than that it existed.

[-]

levodelellis@reddit (OP)

Oh? Any idea why they didn't turn on optimizations?

[-]

vinciblechunk@reddit

Speculating, but it's easy to invoke undefined behavior in C that happens to work at -O0 but breaks at -O2, and if you're a game dev team on a tight deadline, shipping it at -O0 is an easy fix to make the boss happy. Just ask Skyrim's devs

[-]

player2@reddit

Would be hilarious if -O2 -fno-fast-math would have worked

[-]

genpfault@reddit

They should have enabled the fun & safe math optimizations using -funsafe-math-optimizations!

[-]

happyscrappy@reddit

Or just the MIPS pedigree. Part of their design philosophy was take the sophistication out of the hardware and make a good optimizing compiler.

Even more so than RISC in general (SPARC, AMD29K etc.) they did this. And this was in the 32-bit days before the R4400 even came along.

[-]

vinciblechunk@reddit

MIPS is kind of the ultimate "do more with less" ISA

[-]

vytah@reddit

Also, SNES came from the era of fast memory: CPU didn't have any cache, so every instruction always took the same amount of time. On such architectures, inlining and unrolling eliminates jumps and calls, leading to faster code.

In case of MIPS, used in N64, the problem was that the CPU was faster than memory, so it had to have a cache: code was faster if it could fit in cache, so inlining and unrolling often became, like the video says, bad, blowing past the cache size limits.

Then we got CPUs with bigger caches and deeper pipelines, but with no branch prediction. Again, inlining and unrolling become very useful again.

And nowadays, we got CPU's with branch prediction, which means inlining and unrolling are still good, but not as much as they used to.

[-]

ShinyHappyREM@reddit

SNES came from the era of fast memory: CPU didn't have any cache, so every instruction always took the same amount of time

Ironically a ROM access could actually be faster (6 cycles) than a RAM access (8 cycles).^^[0] The exception was the scratchpad RAM on the CPU die for the DMA registers^^[1] which were also in the address space.

And nowadays, we got CPU's with branch prediction, which means inlining and unrolling are still good, but not as much as they used to

Because the code is translated from CISC to RISC and stored in the instruction cache, so inlining and unrolling might fill it up too much. It really depends on the workload and can change just by adding another line of code somewhere.

[-]

player2@reddit

translated from CISC to RISC

This sounds x86-specific, and sounds like an assertion that the CPU actually caches microcode. Is that actually the case?

[-]

ShinyHappyREM@reddit

Yeah, it's called the µOP cache (micro-opcode, not microcode).

I don't know much about current ARM or RISC-V CPUs, they might just use long instruction words where certain bit patterns encode the operation and parameters, and the instruction cache is only for storing the unmodified program code. Itanium (discontinued 4 years ago) might have been the same.

[-]

WJMazepas@reddit

Those optimizations weren't going to be made by a compiler. They are optimizations that every game does it these days.

The thing is, the N64 was an imbalanced console that needed different optimizations than a modern PC from the time would need

[-]

Remarkable_Log_3260@reddit

New reaction image

[-]

mrbuttsavage@reddit

It's kind of amazing people are still dissecting a nearly 30 year old piece of software co-developed with new hardware and tooling and almost surely a very aggressive timeline.

[-]

Additional-Bee1379@reddit

It's not surprising that one of the first games written for the N64 wasn't optimized as much as it could be, but it's still cool to see how much can be squeezed out of hardware that old. It also gave me more insight into how later games on the platform managed to have better graphics despite having the same hardware.

[-]