TL;DW: N64 is extremely memory bandwidth starved so undoing optimizations that trade bandwidth for less CPU cycles tend to net incremental performance boosts.
This article makes no sense. Literally, you can't have an optimization that makes a game slower. This is, literally, the opposite of an optimization. A code change that slows the game down can be called many things, but 'optimization' is not one of them.
Optimization requires defining a quantitative criteria for evaluation. It’s not subjective. Anything that improves fps is an optimization and you can’t just move that goal post because the way to get there is simplicity. Formally, what is described in the OP is an optimization.
What compiler/language is that? If you like optimizations maybe you're interested in looking at SLP. I think that's what replaced reroll https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp
https://reviews.llvm.org/D1940
Yes, specifically to reduce code size, and sometimes to allow the compiler to choose how much unrolling is appropriate for the hardware, including zero times.
I don’t care what you call it, but I do care what compiler experts call it. And they call it an optimization. It’s called “dead code removal” and is a standard optimization pass in all modern compilers.
I was just responding to what you said, which was that you’d never call removing code an optimization. Maybe you meant “functional code”, but that’s not what you said.
I mean that’s the entire premise of this conversation unless you just did that redditor thing where you take one statement in a total vacuum without any consideration for its context in order to make an “erm acktually” comment that nobody appreciates.
Obviously anything that hurts performance isn't an "optimization" by definition. And anything that achieves the same original goal while improving performance is.
So clearly when Kaze's saying "optimizations made Mario 64 SLOWER", it's implied shorthand for "code added in an *attempt* to optimize, and would have been optimizations under other circumstances, actually hurt performance."
Everyone knows by definition a pedantic, literal interpretation doesn't make sense. And everyone can understand what Kaze is communicating, and the tradeoff is that it works well as a terse, eye-catching title (without being misleading).
So it's kind of dumb that someone ITT is "well actually"-ing the obvious. OP also isn't phrasing things very well, but he's spiritually trying to say is "saying you're 'optimizing' by just `git revert`ing the failed experiment someone put in is like saying you're 'cooking' by picking out the olives someone put in your pasta.".
Like... yes, you could certainly say that, but the point is getting lost in the pedantry. The main point was that it's super interesting that Mario 64 code contains a number of seemingly tacked-on, failed experiments that can be easily effectively naively `git revert`ed back to a better-performing baseline.
This work by Kaze is very qualitatively different than the other efforts he's put into optimization in that it almost feels like "anti-work", and "well actually"-ing a point I think literally everyone in this thread actually already understands is just noise, mostly.
"Attempted optimizations" if you want to be pedantic. The point is, they had functional code, then tried to rewrite it to squeeze performance out, but it was counterproductive to their goal
A big bit of me questions whether it was at the end of dev, or done way too early. Some of these felt a bit to me like SNES-era "we just do it that way" optimisations rather than something that came from actually having hardware in front of you.
IIRC back in the day it was reported in some magazines that the game developers had to simulate the console on their SGI workstations before they had access to real ones.
yeah - the early dev hardware was a massive SGI Onyx. You'd see them trotted out for marketing gigs - "you'll get this $100k workstation in a games console!" Later kits ran on the much smaller Indy or even a PC.
Thats kinda why I think this was done early on. The Onyx was a bandwidth beast, but they also knew the CPU and graphics hardware was much more capable than what would end up in the final console, so they optimised for what they knew would get reduced.
Still a very relevant (de-)optimization today. If you have a loop with a condition that is not usually taken, outlining the not-taken branch might help the hot path fit into a single cache line. If the branch predictor can correctly predict the cold path isn’t taken, it won’t prefetch those instructions and your loop will execute entirely out of L1 instruction cache.
Inlining code leads to more instructions in the binary overall, while improving performance by reducing the instructions for an individual function call (there's more to it, but this is the relevant part). It's a tradeoff between CPU performance and memory usage.
Your overly picky distinction was confusing to me, leading me to follow this subthread to dispel my confusion... because I grew up with code being various kinds of assembler mnemonics, which were 1:1 mappings to instructions. That is, I had no problem understanding what they meant by use of the word "code", even though for you it might imply a higher level language.
Instead of just mentally substituting "code" with "instructions" and moving on with your life, you have to put a ton of effort into this giant tirade to prove your... intellectual superiority?
Or is it that you had nothing to add about more instructions == more bandwidth, and you just needed something useless to respond with?
The guy is correct. When you roll up loops there's less instructions. The cache is tiny so it appears that the game would constantly move instructions in and out of the cache
That's also true for modern CPUs if anyone wonders. It used to be that you would unroll loops to save the loop overhead cycles. Nowadays though memory is so much slower than CPUs are that loading less code can be faster than saving a few cycles.
For context: Back then people were programming in assembly for SNES games (mario was first 64 game). People wrote 'optimizations' by hand since that's what you did when you write assembly. C optimizers were somewhat buggy so they werent use, hence why the devs did some by hand
The MIPS CPU in the N64 had an extremely mature compiler ecosystem thanks to the SGI pedigree while the 65c816 core in the SNES was an absolute bitch and a half
Did you program for the 65c816? Someone (outside of reddit) linked me to this. Maybe optimizations wasn't used bc they didn't use SGI workstations and used the gcc compiler which wasn't as trustworthy?
https://old.reddit.com/r/gamedev/comments/8wf7e0/what_were_ps1_and_n64_games_written_in/e1voug9/
If you're truly trying to get to the bottom of why SM64 shipped without compiler optimizations, you might get some insights from the people involved in the decompilation project.
Most of what that guy is saying in that comment tracks. GCC prior to 3.0 was pretty rudimentary and bugs in the optimizer were probably not out of the realm of possibility. N64 being a MIPS target, you did have a choice of several different compilers. I don't know a lot about the SN Systems GCC fork other than that it existed.
Speculating, but it's easy to invoke undefined behavior in C that happens to work at -O0 but breaks at -O2, and if you're a game dev team on a tight deadline, shipping it at -O0 is an easy fix to make the boss happy. Just ask Skyrim's devs
Also, SNES came from the era of fast memory: CPU didn't have any cache, so every instruction always took the same amount of time. On such architectures, inlining and unrolling eliminates jumps and calls, leading to faster code.
In case of MIPS, used in N64, the problem was that the CPU was faster than memory, so it had to have a cache: code was faster if it could fit in cache, so inlining and unrolling often became, like the video says, bad, blowing past the cache size limits.
Then we got CPUs with bigger caches and deeper pipelines, but with no branch prediction. Again, inlining and unrolling become very useful again.
And nowadays, we got CPU's with branch prediction, which means inlining and unrolling are still good, but not as much as they used to.
SNES came from the era of fast memory: CPU didn't have any cache, so every instruction always took the same amount of time
Ironically a ROM access could actually be faster (6 cycles) than a RAM access (8 cycles).^^[0] The exception was the scratchpad RAM on the CPU die for the DMA registers^^[1] which were also in the address space.
And nowadays, we got CPU's with branch prediction, which means inlining and unrolling are still good, but not as much as they used to
Because the code is translated from CISC to RISC and stored in the instruction cache, so inlining and unrolling might fill it up too much. It really depends on the workload and can change just by adding another line of code somewhere.
Yeah, it's called the µOP cache (micro-opcode, not microcode).
I don't know much about current ARM or RISC-V CPUs, they might just use long instruction words where certain bit patterns encode the operation and parameters, and the instruction cache is only for storing the unmodified program code. Itanium (discontinued 4 years ago) might have been the same.
It's kind of amazing people are still dissecting a nearly 30 year old piece of software co-developed with new hardware and tooling and almost surely a very aggressive timeline.
It's not surprising that one of the first games written for the N64 wasn't optimized as much as it could be, but it's still cool to see how much can be squeezed out of hardware that old. It also gave me more insight into how later games on the platform managed to have better graphics despite having the same hardware.
This guy's YouTube channel is amazing. His dedication to Mario64 and the N64 platform as whole is pretty amazing. It's fun to watch and remember how good we have it now.
BlueGoliath@reddit
TL;DW: N64 is extremely memory bandwidth starved so undoing optimizations that trade bandwidth for less CPU cycles tend to net incremental performance boosts.
CrazyHardFit@reddit
This article makes no sense. Literally, you can't have an optimization that makes a game slower. This is, literally, the opposite of an optimization. A code change that slows the game down can be called many things, but 'optimization' is not one of them.
neuralbeans@reddit
That's still an optimisation though. Just optimising data transfer instead of CPU.
levodelellis@reddit (OP)
I would call that going towards baseline so I don't think it should be considered an optimization
ProbsNotManBearPig@reddit
Optimization requires defining a quantitative criteria for evaluation. It’s not subjective. Anything that improves fps is an optimization and you can’t just move that goal post because the way to get there is simplicity. Formally, what is described in the OP is an optimization.
levodelellis@reddit (OP)
If you said this to me in a job setting I'd fire you
Optimizations can and often do slow other parts of the code down
Brayneeah@reddit
What you describe is actually not an uncommon optimisation that compilers make! (if they can verify that doing so won't change a program's results)
levodelellis@reddit (OP)
Compilers don't do that. Unless you ignored the thread and think I'm talking about dead code optimization like that other guy. Don't make up stuff
ehaliewicz@reddit
I just wrote a simple compiler that rolls up loops like this just for fun 🤷
levodelellis@reddit (OP)
What compiler/language is that? If you like optimizations maybe you're interested in looking at SLP. I think that's what replaced reroll https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp
levodelellis@reddit (OP)
I imagine all the implementations would unroll again? It seels like reroll was removed https://github.com/llvm/llvm-project/pull/80972
ehaliewicz@reddit
If it was done specifically to reduce code size, I don't see why they would.
levodelellis@reddit (OP)
maybe but I heard it was to reroll for BLAS
ehaliewicz@reddit
https://reviews.llvm.org/D1940 Yes, specifically to reduce code size, and sometimes to allow the compiler to choose how much unrolling is appropriate for the hardware, including zero times.
levodelellis@reddit (OP)
There's also SLP but IIRC that's more like instruction combining than rolling up the loop
ehaliewicz@reddit
Any semantics preserving transformation that improves performance can be called an optimization and disagreeing seems pretty silly.
ehaliewicz@reddit
Why does it matter what you call it? Are you an authority on the definition or something?
LookIPickedAUsername@reddit
I don’t care what you call it, but I do care what compiler experts call it. And they call it an optimization. It’s called “dead code removal” and is a standard optimization pass in all modern compilers.
levodelellis@reddit (OP)
I am a compiler expert... and that's not dead code
LookIPickedAUsername@reddit
I was just responding to what you said, which was that you’d never call removing code an optimization. Maybe you meant “functional code”, but that’s not what you said.
levodelellis@reddit (OP)
So do you in fact care about what I call it and what I say?
I clearly said deleting code and I said "what the video talks about"
levodelellis@reddit (OP)
I know this is a big ask but reddit, quit being stupid. Check my submission history.
Jarpunter@reddit
Removing functional code is literally not dead code removal
LookIPickedAUsername@reddit
Where did I suggest otherwise?
Jarpunter@reddit
I mean that’s the entire premise of this conversation unless you just did that redditor thing where you take one statement in a total vacuum without any consideration for its context in order to make an “erm acktually” comment that nobody appreciates.
castthisaway5839@reddit
Obviously anything that hurts performance isn't an "optimization" by definition. And anything that achieves the same original goal while improving performance is.
So clearly when Kaze's saying "optimizations made Mario 64 SLOWER", it's implied shorthand for "code added in an *attempt* to optimize, and would have been optimizations under other circumstances, actually hurt performance."
Everyone knows by definition a pedantic, literal interpretation doesn't make sense. And everyone can understand what Kaze is communicating, and the tradeoff is that it works well as a terse, eye-catching title (without being misleading).
So it's kind of dumb that someone ITT is "well actually"-ing the obvious. OP also isn't phrasing things very well, but he's spiritually trying to say is "saying you're 'optimizing' by just `git revert`ing the failed experiment someone put in is like saying you're 'cooking' by picking out the olives someone put in your pasta.".
Like... yes, you could certainly say that, but the point is getting lost in the pedantry. The main point was that it's super interesting that Mario 64 code contains a number of seemingly tacked-on, failed experiments that can be easily effectively naively `git revert`ed back to a better-performing baseline.
This work by Kaze is very qualitatively different than the other efforts he's put into optimization in that it almost feels like "anti-work", and "well actually"-ing a point I think literally everyone in this thread actually already understands is just noise, mostly.
KingJeff314@reddit
"Attempted optimizations" if you want to be pedantic. The point is, they had functional code, then tried to rewrite it to squeeze performance out, but it was counterproductive to their goal
mr_birkenblatt@reddit
Yeah, but the developers put the cpu focused optimizations in by themselves. This video removes this extra code which results in a speedup
Mynameismikek@reddit
A big bit of me questions whether it was at the end of dev, or done way too early. Some of these felt a bit to me like SNES-era "we just do it that way" optimisations rather than something that came from actually having hardware in front of you.
ShinyHappyREM@reddit
IIRC back in the day it was reported in some magazines that the game developers had to simulate the console on their SGI workstations before they had access to real ones.
Mynameismikek@reddit
yeah - the early dev hardware was a massive SGI Onyx. You'd see them trotted out for marketing gigs - "you'll get this $100k workstation in a games console!" Later kits ran on the much smaller Indy or even a PC.
Thats kinda why I think this was done early on. The Onyx was a bandwidth beast, but they also knew the CPU and graphics hardware was much more capable than what would end up in the final console, so they optimised for what they knew would get reduced.
mortaneous@reddit
Man, I miss SGI, some of their hardware was damaged sexy back in the day.
Murky-Relation481@reddit
You could say that damage made them do some RISC-y things.
Jamie_1318@reddit
The game was decompiled, the original source isn't available. There isn't a way to tell which the developers wrote, and which the optimizer added.
mr_birkenblatt@reddit
The devs compiled in debug mode. The compiler didn't add any optimizations
double-you@reddit
Is somebody using "optimization" wrong somewhere or what are claim are you attempting to correct here?
neuralbeans@reddit
I guess the issue is that you can't say that the Mario64 developers were doing optimisations if the system became slower.
falconfetus8@reddit
They thought they were making it faster, at the very least. It's an attempted optimization.
JaggedMetalOs@reddit
I mean it's actually optimizing for the hardware vs the standard programming optimizations you learn in CS.
sammymammy2@reddit
No. The data transfer is so slow that the CPU is stalled while waiting for instructions.
levodelellis@reddit (OP)
Outline (the opposite of inline) blew my mind
player2@reddit
Still a very relevant (de-)optimization today. If you have a loop with a condition that is not usually taken, outlining the not-taken branch might help the hot path fit into a single cache line. If the branch predictor can correctly predict the cold path isn’t taken, it won’t prefetch those instructions and your loop will execute entirely out of L1 instruction cache.
kesawulf@reddit
You're a compiler expert and outlining blew your mind?
BlueGoliath@reddit
I'm not entirely sure how these save bandwidth, especially with like loop rolling.
gingingingingy@reddit
It's more like the code takes up less space so less bandwidth has to be used on moving the code into cache
BlueGoliath@reddit
What "code"? It's instructions.
artofthenunchaku@reddit
Inlining code leads to more instructions in the binary overall, while improving performance by reducing the instructions for an individual function call (there's more to it, but this is the relevant part). It's a tradeoff between CPU performance and memory usage.
BlueGoliath@reddit
I'm aware but people are referring to two different things as if they were the same. They aren't.
artofthenunchaku@reddit
Are we really arguing the semantics of "code" vs "instructions"?
Good Lord
BlueGoliath@reddit
It isn't semantics. User facing code model and actual instructions are likely to be different, especially when optimizations come into play.
glacialthinker@reddit
Your overly picky distinction was confusing to me, leading me to follow this subthread to dispel my confusion... because I grew up with code being various kinds of assembler mnemonics, which were 1:1 mappings to instructions. That is, I had no problem understanding what they meant by use of the word "code", even though for you it might imply a higher level language.
BlueGoliath@reddit
At no point was assembler mnemonics shown in the video. It was all clearly C code. Watch the video or get your eyes checked.
stylist-trend@reddit
You are absolutely insufferable.
Instead of just mentally substituting "code" with "instructions" and moving on with your life, you have to put a ton of effort into this giant tirade to prove your... intellectual superiority?
Or is it that you had nothing to add about more instructions == more bandwidth, and you just needed something useless to respond with?
lx45803@reddit
This is reddit
levodelellis@reddit (OP)
The guy is correct. When you roll up loops there's less instructions. The cache is tiny so it appears that the game would constantly move instructions in and out of the cache
gingingingingy@reddit
The instructions still have to be stored somewhere as code which is going to take up space/memory in cache.
player2@reddit
The instructions need to be transferred from cartridge ROM to main memory before the CPU can access them.
UncleMeat11@reddit
It's not a completely uncommon technique in the broader compilers space.
uCodeSherpa@reddit
For the record, it isn’t just N64. On modern hardware, just recalculating things is frequently faster than caching them.
aanzeijar@reddit
That's also true for modern CPUs if anyone wonders. It used to be that you would unroll loops to save the loop overhead cycles. Nowadays though memory is so much slower than CPUs are that loading less code can be faster than saving a few cycles.
levodelellis@reddit (OP)
For context: Back then people were programming in assembly for SNES games (mario was first 64 game). People wrote 'optimizations' by hand since that's what you did when you write assembly. C optimizers were somewhat buggy so they werent use, hence why the devs did some by hand
vinciblechunk@reddit
The MIPS CPU in the N64 had an extremely mature compiler ecosystem thanks to the SGI pedigree while the 65c816 core in the SNES was an absolute bitch and a half
levodelellis@reddit (OP)
Did you program for the 65c816? Someone (outside of reddit) linked me to this. Maybe optimizations wasn't used bc they didn't use SGI workstations and used the gcc compiler which wasn't as trustworthy?
https://old.reddit.com/r/gamedev/comments/8wf7e0/what_were_ps1_and_n64_games_written_in/e1voug9/
vinciblechunk@reddit
If you're truly trying to get to the bottom of why SM64 shipped without compiler optimizations, you might get some insights from the people involved in the decompilation project.
vinciblechunk@reddit
65c816 not professionally, have dabbled.
Most of what that guy is saying in that comment tracks. GCC prior to 3.0 was pretty rudimentary and bugs in the optimizer were probably not out of the realm of possibility. N64 being a MIPS target, you did have a choice of several different compilers. I don't know a lot about the SN Systems GCC fork other than that it existed.
levodelellis@reddit (OP)
Oh? Any idea why they didn't turn on optimizations?
vinciblechunk@reddit
Speculating, but it's easy to invoke undefined behavior in C that happens to work at -O0 but breaks at -O2, and if you're a game dev team on a tight deadline, shipping it at -O0 is an easy fix to make the boss happy. Just ask Skyrim's devs
player2@reddit
Would be hilarious if
-O2 -fno-fast-math
would have workedgenpfault@reddit
They should have enabled the fun & safe math optimizations using
-funsafe-math-optimizations
!happyscrappy@reddit
Or just the MIPS pedigree. Part of their design philosophy was take the sophistication out of the hardware and make a good optimizing compiler.
Even more so than RISC in general (SPARC, AMD29K etc.) they did this. And this was in the 32-bit days before the R4400 even came along.
vinciblechunk@reddit
MIPS is kind of the ultimate "do more with less" ISA
vytah@reddit
Also, SNES came from the era of fast memory: CPU didn't have any cache, so every instruction always took the same amount of time. On such architectures, inlining and unrolling eliminates jumps and calls, leading to faster code.
In case of MIPS, used in N64, the problem was that the CPU was faster than memory, so it had to have a cache: code was faster if it could fit in cache, so inlining and unrolling often became, like the video says, bad, blowing past the cache size limits.
Then we got CPUs with bigger caches and deeper pipelines, but with no branch prediction. Again, inlining and unrolling become very useful again.
And nowadays, we got CPU's with branch prediction, which means inlining and unrolling are still good, but not as much as they used to.
ShinyHappyREM@reddit
Ironically a ROM access could actually be faster (6 cycles) than a RAM access (8 cycles).^^[0] The exception was the scratchpad RAM on the CPU die for the DMA registers^^[1] which were also in the address space.
Because the code is translated from CISC to RISC and stored in the instruction cache, so inlining and unrolling might fill it up too much. It really depends on the workload and can change just by adding another line of code somewhere.
player2@reddit
This sounds x86-specific, and sounds like an assertion that the CPU actually caches microcode. Is that actually the case?
ShinyHappyREM@reddit
Yeah, it's called the µOP cache (micro-opcode, not microcode).
I don't know much about current ARM or RISC-V CPUs, they might just use long instruction words where certain bit patterns encode the operation and parameters, and the instruction cache is only for storing the unmodified program code. Itanium (discontinued 4 years ago) might have been the same.
WJMazepas@reddit
Those optimizations weren't going to be made by a compiler. They are optimizations that every game does it these days.
The thing is, the N64 was an imbalanced console that needed different optimizations than a modern PC from the time would need
Remarkable_Log_3260@reddit
New reaction image
mrbuttsavage@reddit
It's kind of amazing people are still dissecting a nearly 30 year old piece of software co-developed with new hardware and tooling and almost surely a very aggressive timeline.
Additional-Bee1379@reddit
It's not surprising that one of the first games written for the N64 wasn't optimized as much as it could be, but it's still cool to see how much can be squeezed out of hardware that old. It also gave me more insight into how later games on the platform managed to have better graphics despite having the same hardware.
dylan_1992@reddit
People do the same with art, architecture, etc.
ZackyZack@reddit
That's honestly a really cool perspective
joe-knows-nothing@reddit
This guy's YouTube channel is amazing. His dedication to Mario64 and the N64 platform as whole is pretty amazing. It's fun to watch and remember how good we have it now.
mr_birkenblatt@reddit
He's also working on a mario64 engine based game