How NASA Built Artemis II’s Fault-Tolerant Computer

Posted by Successful_Bowl2564@reddit | programming | View on Reddit | 62 comments

[-]

HalfEmbarrassed4433@reddit

the level of redundancy nasa builds into these systems is fascinating. meanwhile most of us cant even get our deploy pipelines to not break on a friday afternoon

[-]

Dekarion@reddit

The crazy part is, NASA engineers aren't any better at writing software than anywhere else -- they're just better at following processes and checklists.

[-]

I've wondered if there is a way to theoretically model computing in a "hostile environment" - for example,simulate random memory corruption where there is a certain probability that each byte of memory can flip a bit per every cycle - so let's say that each bit has a 1 in 100 million chance of flipping per cycle, and you have 100 million bits.

Can software be made that can recover from spontaneous memory corruption, including even in CPU registers if need be...

[-]

Internet-of-cruft@reddit

The solution traditionally has been to duplicate instances and use quorum to make decisions for critical services.

For your specific use case of memory corruption, we've been doing that for a long time: ECC Memory. It has extra parity bits used to determine if there was a soft flip.

It can be as simple as detecting the flip (and crashing or otherwise halting) to supporting recovery.

[-]

zzzthelastuser@reddit

and use quorum to make decisions

yeah, but what if that specific decision bit gets flipped? They could repeat the same process for the decision making itself, right?

[-]

mccoyn@reddit

You can use better components for the vote taking. For example, you might have thousands of transistors that are involved in deciding whether to open a valve for maneuvering thrusters, but you only need one transistor to actually open it. So, that transistor is replaced with a robust voting system using relays instead of transistors, or just bigger transistors running at higher voltage that isn't so easily corrupted.

[-]

wannaliveonmars@reddit

I had heard that NASA used to use old 386 processors for its probes exactly because their cruder (and bigger) transistors were less susceptible to radiation. Not sure if it's true though, but it sounds plausible.

[-]

EntroperZero@reddit

Lower power draw, too. You don't even need a heatsink on a 386.

[-]

Jason3211@reddit

That's one of the reasons. But primarily it was a "if it ain't broke, don't fix it" and "if it's validated, why test something new?"

From a modern tech perspective, the processing power of more advanced processors (let's say, anything after the 486 lines), wouldn't have given NASA any further capabilities than they already had. Calculations for positioning, vectoring, throttling, engine management, safety systems, etc, aren't compute heavy (by modern standards). They don't really let spacecraft model things in real-time, because they've pre-modeled every possible scenario and baked those into the control logic.

It's really fascinating how different the software/compute approaches are between NASA/space/aircraft and consumer/business needs.

Fun stuff!

[-]

ShinyHappyREM@reddit

It's really fascinating how different the software/compute approaches are between NASA/space/aircraft and consumer/business needs.

yep

[-]

Jason3211@reddit

Watched the first 10 minutes and am HOOKED. Can't wait to watch more later after the kiddo goes down tonight. Thank you for the awesome vid!

[-]

ShinyHappyREM@reddit

No problem :)

I stumbled upon that talk when it was mentioned in this (almost) unrelated talk.

[-]

gimpwiz@reddit

They also used rad-hardened CPUs as well.

[-]

Successful-Money4995@reddit

ECC is more like 8 bits out of every 72. Each 64 bit number is assigned a different 72 bit value. When a 72 bit value is read that doesn't match one of the 64 bit numbers, you can figure out which 72 bit number that does match a 64 bit number is closest, as in, requiring the fewest bit flips to get there. And then use that one.

The number of errors that can be detected or corrected depends on your encoding. With just a single parity bit, you can only detect an error. With more bits, you can also correct errors.

[-]

WoodyTheWorker@reddit

8 bit ECC corrects a single error and allows to detect two bit errors

[-]

gimpwiz@reddit

Also known as "SECDED." Modern ECC on a server chip is usually DECTED, double error correction triple error detection.

[-]

gramathy@reddit

Iirc the general rule is a one-bit correction requires log (x) bits in order to positively identify the flipped bit, which is why there are 8 bits of parity in ECC. Hardware handling a single flip (most common) means the software doesn’t need to recover unless you get multiple flips.

[-]

Successful-Money4995@reddit

Yup.

Imagine a graph where each node is a 72 bit value and there are lines connecting each node to each node that has one different bit. For one bit error correction, you want each node that represents a symbol to have all adjacent nodes also "point" at that node, so that you can resolve all those adjacent nodes as the true value. The number of adjacent nodes is 72. Plus the existing node, that's 73.

So the number of symbols that you can represent is 2 to the 72 divided by 73. That gets you more than 2 to the 64 you want. The rest can be used to detect two bit errors though not correct them.

[-]

axonxorz@reddit

You can make your "decision" value a non-binary one. Say 10001110 for false and 01110001 for true. You could even do one-bit-per-vote, though I'm not sure if having actual detail data about the vote communicated in that way is usefukl. The value is small enough that it can fit in a register and have atomic operations/comparisons performed, but large enough that flipping 8 bits in a (likely) physically small area on the CPU die is massively improbable.

[-]

stumblinbear@reddit

The odds of the same not being flipped on three difference devices in the same instant is infinitesimally low

[-]

pierrefermat1@reddit

He's referring to if you were to aggregate the results and have a final outcome, the bit flips happens right after on the decision value

[-]

seweso@reddit

Yes, anything running at scale has to account for random bit flips in memory and registers.

I made some rfid driver for a medical devices, that went into a xray chamber, getting bombarded with xrays until the device + software failed. Very cool stuff.

[-]

nattylife@reddit

im curious, could you elaborate a little more on a specific test case you saw. was there similar redundancy protocols for those kinda devices too?

[-]

seweso@reddit

The test is just a lead lined box (oversized toilet), with an XRay light bulb. We placed fixed rfid tags at every antenna (this thing got 5 antenna's). Log all rfid + timestamps to file. And then close the door. And run the light at various intensities / duration till it breaks.

It took very very long for it to break. So we didn't need to add any extra software hacks to recover from such errors. So in that sense it wasn't that exciting, more a formality.

[-]

mycall@reddit

How did the internal redundancy work inside the rfid tags so they would remain reliable?

[-]

-Hi-Reddit@reddit

Probably via hamming codes

[-]

lunacraz@reddit

does it damage the system? or just makes the data unreliable?

[-]

seweso@reddit

The hardware was irreversibly broken after the test. Completely fried.

There were no interesting half failure modes in the logs. It worked, and then dead. I would have wanted to test more at the edge of functional/non-functional. But that wasn’t needed…

[-]

LegendEater@reddit

ECC RAM has existed for a long time

[-]

crozone@reddit

Did you like, read the article

To reach this level of confidence, NASA now employs modern verification workflows. This includes full-environment simulations and Monte Carlo stress testing to model worst-case latencies and communication outages. High-performance supercomputers are used for large-scale fault injection, emulating entire flight timelines where catastrophic hardware failures are introduced to see if the software can successfully ‘fail silent’ and recover.

[-]

marcusroar@reddit

lol I was about to say 🤣

[-]

gimpwiz@reddit

Get a complex CPU soldered down with a basic bitch lead solder, run some memory tests. You'll find bit flips ;)

[-]

lobax@reddit

Take a look at Erlang/BEAM and their fault tolerant approach. It handles exactly that.

Probably not suitable for space, but it was built for highly concurrent, highly distributed applications (phone switches and other telecommunications infrastructure) where errors can and do happen anywhere anytime.

https://en.wikipedia.org/wiki/Erlang_(programming_language)

[-]

quantum_splicer@reddit

Just take it to Chernobyl, in all seriousness we know how much radiation these computers are expected to be bombarded with, so you can bombard with the X rays at the expected radiation intensity and beyond the expected duration the components are expected to work.

Then you'd perform destructive testing where essentially you see how far you can take things until components fail.

(1) Endurance testing - long exposure to expected radiation on time scales x amount longer than mission duration. (2) Radiation intensity testing - exposure to radiation several times higher than expected

[-]

OffbeatDrizzle@reddit

check out hamming codes. you only ever get protection from X bits being flipped - there is never a 100% guarantee.

also, error correcting memory is a thing so that you don't waste CPU cycles having to verify the state of your own memory

it doesn't sound plausible that you can correct bit flips in CPU registers - you can emulate such a thing by overclocking / undervolting your CPU and it will crash and burn. you would probably need redundancy in the form of 2 (or more) separate CPUs coming to an agreement on the outcome of a calculation, or some actual physical hardware error correction. flipping a bit in an instruction running on the cpu can be pretty fatal

[-]

SpacepantsMcGee@reddit

Yes, giving a full answer is of course impossible since this is the type of problem that is solved by having multiple expertise and on multiple levels (from hardware ECC all the way to formal methods.)

I can give an example from a field with which I am highly familiar: in model checking (modeling every step and transition that a system can have and then letting a checker program explore it while checking that certain specified logical conditions hold) you would normally include stutters and faults as steps that are always possible, so that you don't only check that the logic of the system is correct, you also check that it can handle different modes of failure predictably.

[-]

currentscurrents@reddit

There's always neural networks, which are extremely robust to errors because they have the property that small changes to the weights result in small changes to the output. It is common practice to intentionally inject noise during training (dropout) to prevent overfitting.

You would need to alter a significant percentage of the weights before you noticed any issues, and even then performance would degrade smoothly.

[-]

FourSpeedToaster@reddit

The Tigerbeetle database has a simulator that they run through for lots of different errors they see including stuff like disk corruption. They even made a little game out of it Tigerbeetle Simulator

[-]

SpaceToaster@reddit

The super low tech solution is multiple copies. Error correcting memory (ECC) already exists that checks for bit flips and you could run multiple CPUs doing the exact same computations in parallel to reach consensus.

[-]

remy_porter@reddit

I worked on a project that did exactly that. We ended up abandoning it because for the mission in play, the chance of a single event upset outside of our ECC RAM was low enough that it didn't make sense. But the idea was that we'd use triple module redundancy and a variant of the Raft algorithm for getting consensus. Paper.

[-]

wannaliveonmars@reddit

And the software could theoretically do even more high level recovery - for example rerunning a function if it noticed that the function got corrupted midway, backtracking on the stack and redoing work if necessary... It would have to keep in mind idempotency of course

[-]

Frolo_NA@reddit

mutation testing and things like netflix chaos monkey

[-]

omitname@reddit

Take a look at antithesis

[-]

wannaliveonmars@reddit

antithesis

Is that a repo or project?

[-]

SirPsychoMantis@reddit

Company: https://antithesis.com/

[-]

ShinyHappyREM@reddit

Can software be made that can recover from spontaneous memory corruption, including even in CPU registers if need be...

I'd guess that you could write a programming language that treats a group of physical bits as one logical bit. Then you periodically "refresh" these logical bits, e.g. looking up a 4-bit group in a 16-entry look-up table, or via POPCNT.

This is much faster to do on a hardware level though.

[-]

bobj33@reddit

This is the CPU used in many NASA space probes.

https://en.wikipedia.org/wiki/RAD750

It's a radiation hardened version of a 25 year old PowerPC chip similar to what would have been in a Mac back then.

You can read more here.

https://en.wikipedia.org/wiki/Radiation_hardening

People already mentioned ECC for the memory but ECC algorithms are used internally on CPU / SoC chips for data buses and caches.

[-]

Dekarion@reddit

“Modern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,” Riley explained. “As a result, technical debt accumulates, and maintainability and system resiliency suffer.”

Really felt this. But honestly if you care about stable software you want determinism. It does feel like it takes way more effort in modern organizations to try to maintain especially when doing agile at any scale..

[-]

} catch {

}

[-]

Let me put it this way, Mr. Amer. The 9000 series is the most reliable computer ever made. No 9000 computer has ever made a mistake, or distorted information. We are all, by any practical definition of the word, foolproof and incapable of error.

[-]

braddillman@reddit

"Assume you're my father and the owner of a pod bay door opening business, you're training me to take over the family business."

[-]

ignorantpisswalker@reddit

Using copilot? With a redundant Outlook app?

I am getting banned from here. Right...

[-]

wildjokers@reddit

huh?