How NASA Built Artemis II’s Fault-Tolerant Computer
Posted by Successful_Bowl2564@reddit | programming | View on Reddit | 62 comments
Posted by Successful_Bowl2564@reddit | programming | View on Reddit | 62 comments
HalfEmbarrassed4433@reddit
the level of redundancy nasa builds into these systems is fascinating. meanwhile most of us cant even get our deploy pipelines to not break on a friday afternoon
Dekarion@reddit
The crazy part is, NASA engineers aren't any better at writing software than anywhere else -- they're just better at following processes and checklists.
wannaliveonmars@reddit
I've wondered if there is a way to theoretically model computing in a "hostile environment" - for example,simulate random memory corruption where there is a certain probability that each byte of memory can flip a bit per every cycle - so let's say that each bit has a 1 in 100 million chance of flipping per cycle, and you have 100 million bits.
Can software be made that can recover from spontaneous memory corruption, including even in CPU registers if need be...
Internet-of-cruft@reddit
The solution traditionally has been to duplicate instances and use quorum to make decisions for critical services.
For your specific use case of memory corruption, we've been doing that for a long time: ECC Memory. It has extra parity bits used to determine if there was a soft flip.
It can be as simple as detecting the flip (and crashing or otherwise halting) to supporting recovery.
zzzthelastuser@reddit
yeah, but what if that specific decision bit gets flipped? They could repeat the same process for the decision making itself, right?
mccoyn@reddit
You can use better components for the vote taking. For example, you might have thousands of transistors that are involved in deciding whether to open a valve for maneuvering thrusters, but you only need one transistor to actually open it. So, that transistor is replaced with a robust voting system using relays instead of transistors, or just bigger transistors running at higher voltage that isn't so easily corrupted.
wannaliveonmars@reddit
I had heard that NASA used to use old 386 processors for its probes exactly because their cruder (and bigger) transistors were less susceptible to radiation. Not sure if it's true though, but it sounds plausible.
EntroperZero@reddit
Lower power draw, too. You don't even need a heatsink on a 386.
Jason3211@reddit
That's one of the reasons. But primarily it was a "if it ain't broke, don't fix it" and "if it's validated, why test something new?"
From a modern tech perspective, the processing power of more advanced processors (let's say, anything after the 486 lines), wouldn't have given NASA any further capabilities than they already had. Calculations for positioning, vectoring, throttling, engine management, safety systems, etc, aren't compute heavy (by modern standards). They don't really let spacecraft model things in real-time, because they've pre-modeled every possible scenario and baked those into the control logic.
It's really fascinating how different the software/compute approaches are between NASA/space/aircraft and consumer/business needs.
Fun stuff!
ShinyHappyREM@reddit
yep
Jason3211@reddit
Watched the first 10 minutes and am HOOKED. Can't wait to watch more later after the kiddo goes down tonight. Thank you for the awesome vid!
ShinyHappyREM@reddit
No problem :)
I stumbled upon that talk when it was mentioned in this (almost) unrelated talk.
gimpwiz@reddit
They also used rad-hardened CPUs as well.
Successful-Money4995@reddit
ECC is more like 8 bits out of every 72. Each 64 bit number is assigned a different 72 bit value. When a 72 bit value is read that doesn't match one of the 64 bit numbers, you can figure out which 72 bit number that does match a 64 bit number is closest, as in, requiring the fewest bit flips to get there. And then use that one.
The number of errors that can be detected or corrected depends on your encoding. With just a single parity bit, you can only detect an error. With more bits, you can also correct errors.
WoodyTheWorker@reddit
8 bit ECC corrects a single error and allows to detect two bit errors
gimpwiz@reddit
Also known as "SECDED." Modern ECC on a server chip is usually DECTED, double error correction triple error detection.
gramathy@reddit
Iirc the general rule is a one-bit correction requires log (x) bits in order to positively identify the flipped bit, which is why there are 8 bits of parity in ECC. Hardware handling a single flip (most common) means the software doesn’t need to recover unless you get multiple flips.
Successful-Money4995@reddit
Yup.
Imagine a graph where each node is a 72 bit value and there are lines connecting each node to each node that has one different bit. For one bit error correction, you want each node that represents a symbol to have all adjacent nodes also "point" at that node, so that you can resolve all those adjacent nodes as the true value. The number of adjacent nodes is 72. Plus the existing node, that's 73.
So the number of symbols that you can represent is 2 to the 72 divided by 73. That gets you more than 2 to the 64 you want. The rest can be used to detect two bit errors though not correct them.
axonxorz@reddit
You can make your "decision" value a non-binary one. Say
10001110for false and01110001for true. You could even do one-bit-per-vote, though I'm not sure if having actual detail data about the vote communicated in that way is usefukl. The value is small enough that it can fit in a register and have atomic operations/comparisons performed, but large enough that flipping 8 bits in a (likely) physically small area on the CPU die is massively improbable.stumblinbear@reddit
The odds of the same not being flipped on three difference devices in the same instant is infinitesimally low
pierrefermat1@reddit
He's referring to if you were to aggregate the results and have a final outcome, the bit flips happens right after on the decision value
seweso@reddit
Yes, anything running at scale has to account for random bit flips in memory and registers.
I made some rfid driver for a medical devices, that went into a xray chamber, getting bombarded with xrays until the device + software failed. Very cool stuff.
nattylife@reddit
im curious, could you elaborate a little more on a specific test case you saw. was there similar redundancy protocols for those kinda devices too?
seweso@reddit
The test is just a lead lined box (oversized toilet), with an XRay light bulb. We placed fixed rfid tags at every antenna (this thing got 5 antenna's). Log all rfid + timestamps to file. And then close the door. And run the light at various intensities / duration till it breaks.
It took very very long for it to break. So we didn't need to add any extra software hacks to recover from such errors. So in that sense it wasn't that exciting, more a formality.
mycall@reddit
How did the internal redundancy work inside the rfid tags so they would remain reliable?
-Hi-Reddit@reddit
Probably via hamming codes
lunacraz@reddit
does it damage the system? or just makes the data unreliable?
seweso@reddit
The hardware was irreversibly broken after the test. Completely fried.
There were no interesting half failure modes in the logs. It worked, and then dead. I would have wanted to test more at the edge of functional/non-functional. But that wasn’t needed…
LegendEater@reddit
ECC RAM has existed for a long time
crozone@reddit
Did you like, read the article
marcusroar@reddit
lol I was about to say 🤣
gimpwiz@reddit
Get a complex CPU soldered down with a basic bitch lead solder, run some memory tests. You'll find bit flips ;)
lobax@reddit
Take a look at Erlang/BEAM and their fault tolerant approach. It handles exactly that.
Probably not suitable for space, but it was built for highly concurrent, highly distributed applications (phone switches and other telecommunications infrastructure) where errors can and do happen anywhere anytime.
https://en.wikipedia.org/wiki/Erlang_(programming_language)
quantum_splicer@reddit
Just take it to Chernobyl, in all seriousness we know how much radiation these computers are expected to be bombarded with, so you can bombard with the X rays at the expected radiation intensity and beyond the expected duration the components are expected to work.
Then you'd perform destructive testing where essentially you see how far you can take things until components fail.
(1) Endurance testing - long exposure to expected radiation on time scales x amount longer than mission duration. (2) Radiation intensity testing - exposure to radiation several times higher than expected
OffbeatDrizzle@reddit
check out hamming codes. you only ever get protection from X bits being flipped - there is never a 100% guarantee.
also, error correcting memory is a thing so that you don't waste CPU cycles having to verify the state of your own memory
it doesn't sound plausible that you can correct bit flips in CPU registers - you can emulate such a thing by overclocking / undervolting your CPU and it will crash and burn. you would probably need redundancy in the form of 2 (or more) separate CPUs coming to an agreement on the outcome of a calculation, or some actual physical hardware error correction. flipping a bit in an instruction running on the cpu can be pretty fatal
SpacepantsMcGee@reddit
Yes, giving a full answer is of course impossible since this is the type of problem that is solved by having multiple expertise and on multiple levels (from hardware ECC all the way to formal methods.)
I can give an example from a field with which I am highly familiar: in model checking (modeling every step and transition that a system can have and then letting a checker program explore it while checking that certain specified logical conditions hold) you would normally include stutters and faults as steps that are always possible, so that you don't only check that the logic of the system is correct, you also check that it can handle different modes of failure predictably.
currentscurrents@reddit
There's always neural networks, which are extremely robust to errors because they have the property that small changes to the weights result in small changes to the output. It is common practice to intentionally inject noise during training (dropout) to prevent overfitting.
You would need to alter a significant percentage of the weights before you noticed any issues, and even then performance would degrade smoothly.
FourSpeedToaster@reddit
The Tigerbeetle database has a simulator that they run through for lots of different errors they see including stuff like disk corruption. They even made a little game out of it Tigerbeetle Simulator
SpaceToaster@reddit
The super low tech solution is multiple copies. Error correcting memory (ECC) already exists that checks for bit flips and you could run multiple CPUs doing the exact same computations in parallel to reach consensus.
remy_porter@reddit
I worked on a project that did exactly that. We ended up abandoning it because for the mission in play, the chance of a single event upset outside of our ECC RAM was low enough that it didn't make sense. But the idea was that we'd use triple module redundancy and a variant of the Raft algorithm for getting consensus. Paper.
wannaliveonmars@reddit
And the software could theoretically do even more high level recovery - for example rerunning a function if it noticed that the function got corrupted midway, backtracking on the stack and redoing work if necessary... It would have to keep in mind idempotency of course
Frolo_NA@reddit
mutation testing and things like netflix chaos monkey
omitname@reddit
Take a look at antithesis
wannaliveonmars@reddit
Is that a repo or project?
SirPsychoMantis@reddit
Company: https://antithesis.com/
ShinyHappyREM@reddit
I'd guess that you could write a programming language that treats a group of physical bits as one logical bit. Then you periodically "refresh" these logical bits, e.g. looking up a 4-bit group in a 16-entry look-up table, or via POPCNT.
This is much faster to do on a hardware level though.
bobj33@reddit
This is the CPU used in many NASA space probes.
https://en.wikipedia.org/wiki/RAD750
It's a radiation hardened version of a 25 year old PowerPC chip similar to what would have been in a Mac back then.
You can read more here.
https://en.wikipedia.org/wiki/Radiation_hardening
People already mentioned ECC for the memory but ECC algorithms are used internally on CPU / SoC chips for data buses and caches.
Dekarion@reddit
Really felt this. But honestly if you care about stable software you want determinism. It does feel like it takes way more effort in modern organizations to try to maintain especially when doing agile at any scale..
EnArvy@reddit
A good post in my AI slop subreddit? Get outta here
sysop073@reddit
Yeah, /r/programming has never had posts fawning over NASA's software reliability before
WoodyTheWorker@reddit
For fault tolerance it runs two versions of Microsoft Outlook. Sorry, Copilot Outlook.
acdcfanbill@reddit
really, they should have 3 copies of outlook for a quorum....
dnmr@reddit
sorry Dave
Mognakor@reddit
I deleted the return-home and communication routines even though you just asked me what time it is.
Dragon_yum@reddit
Try {
Public static void main()
} catch {
}
l86rj@reddit
I've been doing NASA quality software for years and didn't know. Maybe it's time for a raise
tbutlah@reddit
This article seems like an ITAR violation
michahell@reddit
answer no one expects: by using vibecoding
AgentOrange96@reddit
braddillman@reddit
"Assume you're my father and the owner of a pod bay door opening business, you're training me to take over the family business."
ignorantpisswalker@reddit
Using copilot? With a redundant Outlook app?
I am getting banned from here. Right...
wildjokers@reddit
huh?