Fail loudly: a plea to stop hiding bugs

[-]

dianaska@reddit

This maps almost exactly to something I've been researching from the user side, specifically in auth and verification flows in messaging and video apps.

The failure mode we see most in user reviews for communication apps is "the app did something, I don't know what, and I can't get out of it." Silent UI failure, but the system isn't broken but deliberately conservative (rate limit hit, session token issued before dependent services caught up, carrier-level retry constraint) and just... not saying so.

MetaMetaMan's point about letting the user know the program ran into an unexpected state is the crux of it. The system knows what happened. It's choosing not to share. And from the user's perspective that's indistinguishable from a bug (+ there s no recovery path).

Genuinely curious from the engineers in this thread: when you've built auth or session systems, what's the gap between what the system actually does and what the UI implies is happening? Specifically around verification flows or session persistence. Working on a piece about this and looking for the engineering reality behind these patterns.

[-]

deleted_by_reddit@reddit

[removed]

[-]

programming-ModTeam@reddit

Your post or comment was overly uncivil.

[-]

programming-ModTeam@reddit

No content written mostly by an LLM. If you don't want to write it, we don't want to read it.

[-]

wademealing@reddit

Bro will do anything but code erlang.

[-]

auronedge@reddit

Please don't. No client wants to see crashes. Fail hard and fail fast is retarded and just erodes trust long term

[-]

bwmat@reddit

I work on software written in unsafe native choice which handles data going into and coming out of databases

If my code sniffs a possibility of a logic bug, it's going to fail the current operation, loudly.

If it's something which, to me, seems 'impossible' (like a private variable obtaining a value that no code with legal access to it would ever write), I'm murdering that process right away

[-]

auronedge@reddit

It's not your software

[-]

Tordek@reddit

Your medical device software can't fail hard during surgery for example.

https://en.wikipedia.org/wiki/Therac-25

Go learn.

[-]

cake-day-on-feb-29@reddit

Your medical device software can't fail hard during surgery for example

Obviously it's situation dependent, like almost every other decision an engineer makes, but I would imagine (hope) that the system would restart itself if it encounters an error state it couldn't recover from.

Let's say it's a robot performing surgery. It receives a sensor value that the arm is in an impossible position, inside patient's head (when it's a heart surgery). How does the system handle it?

A. Fail fast, the system reboots and gets fresh values from the sensors.

B. Fail silently, and, what, attempts to move the arm from the head position to the correct heart position? Oops that was wrong, patient is now bleeding out since you ripped their heart out.

[-]

bwmat@reddit

And that's even though our software is usually used by customers as a shared library by third party software

I don't care, it's for your own good

[-]

BrickedMouse@reddit

Fail fast makes errors visible and fixable during testing. Making less errors happen when released to the customer. It does not mean to give more errors to the user. Worst case, it would be the same, but just visible.

[-]

auronedge@reddit

It doesn't. Most devs work the happy paths. Errors don't happen on the happy path.

Testing makes errors visible, fail hard is bro science. I've been on 2 teams that failed hard, their code was poor because "if its unexpected then it will crash" but since most people were in the happy path (fast PCs, local dev eng, no latency, etc) what actually happened was terrible deployed code that crashed a lot

[-]

davidalayachew@reddit

I've been on 2 teams that failed hard, their code was poor because "if its unexpected then it will crash" but since most people were in the happy path (fast PCs, local dev eng, no latency, etc) what actually happened was terrible deployed code that crashed a lot

Wait, this is not the same thing as fail-fast.

Fail fast does not mean don't assert invariants. Fail fast means that you fail as early as possible, preventing bad data from even entering the system. The concept of "letting it crash" is not at all what fail-fast means, and might even be the opposite. Fail fast means that you go out of your way to ensure that an invariant has been maintained, as early as possible. An idea that builds off of this is Parse, don't (just) validate. The point she makes in the article is that choosing to parse will enable you to ensure that those invariants are maintained all the way through, not just at validation time. Thus, extending the concept of fail-fast even further.

[-]

auronedge@reddit

Fail fast mantra breeds a culture where testing and error handling gets left by the wayside because if it fails they will catch it. On the other hand a team that actually cares about error handling tends to put more focus on software quality

[-]

davidalayachew@reddit

Fail fast mantra breeds a culture where testing and error handling gets left by the wayside because if it fails they will catch it. On the other hand a team that actually cares about error handling tends to put more focus on software quality

Ok, then this proves it -- you and I are using the same phrase ("fail fast") to mean 2 very different things.

Then that's fine, sounds like your teammates made some lazy choices, then called it "fail fast", and it sounds like you had to pay the price for it, what with your team failing hard. Sorry to hear that.

Just know that, for plenty of other teams, "fail fast" means literally the opposite lol, where we enhance our error-checking and parsing logic so that we can catch more errors earlier, allowing us to make safe guarantees about our system. And we usually use the type system to hold us accountable, so that if something changes to where those guarantees no longer hold true, our code should no longer compile.

But yeah, sorry to hear you went through that.

[-]

zten@reddit

I don’t think fail hard/fast and actually handling failure if and when you can reasonably do so are mutually exclusive. I just want people to do literally anything but log and swallow because that’s just kicking the can down the road. I have to ask if that’s the right thing in every PR I see. I’m working on lots of data processing pipelines and when people default to doing that, eventually someone asks weeks later where their data is after a feature ships because the processing job was just hiding errors with log and swallow and crucially nobody integrated any log monitoring, tracing, or metrics collection to alert people.

[-]

CloudsOfMagellan@reddit

I feel like this sums up everything that's wrong with the world. People prefer the appearance of stability and reliability, rather than actual stability and reliability.

[-]

auronedge@reddit

Is your app stable if it crashes? Failing hard doesn't improve reliability, it just pushes testing to someone who will either not provide feedback or just discard your software for something more reliable. A crash maybe useful for you to look at but entirely pointless to the end user.

[-]

cake-day-on-feb-29@reddit

iling hard doesn't improve reliability, it just pushes testing to someone who will either not provide feedback...A crash maybe useful for you to look at but entirely pointless to the end user.

This is where systems that can send you back crash reports or logs are useful.

just discard your software for something more reliable.

I mean if your users are anything like gamers they will put up with a lot of issues before moving on.

But if your unhandled logic error is either going to a) fail fast, or b) fail silently and potentially cause corruption or other issues, I don't see what the difference is. If the user loses data due to corruption, I'd imagine they'd be more upset than a crash.

Obviously your program should not crash, nor should it corrupt data, nor should it get into any kind of incorrect state. But the world is not perfect, you and your coworkers aren't perfect, users and their computers aren't perfect, so bad things happen. Imagine a bit flip that pushes your program into a bad state. Would you rather it fail fast, and the user have a 1-in-a-million crash, or would you rather have the program remain in the bad state, corrupting data.

Recall the Steam bug that ended up deleting users' data. The issue was due to the script trying to uninstall itself. I would rather the uninstaller fail fast than try to delete all my files.

[-]

CloudsOfMagellan@reddit

But the bug is there either way, better to crash and keep data intact rather than keep going and risk breaking things that can't be recovered later.

[-]

cake-day-on-feb-29@reddit

Please don't. No client wants to see crashes. Fail hard and fail fast is retarded and just erodes trust long term.

Personally I trust a program much less when it just sits there, zoned out, doing nothing, because something failed silently (maybe? sometimes this is because there's no progress indicator and it really is doing something).

[-]

olearyboy@reddit

Throw throw throw

It’s the one thing that I wish developers picked up from Java the concept of enforced exception chaining.

[-]

sweating_teflon@reddit

Many Java developers themselves can't get this right and insist on catching exceptions as early as possible to either rewrap them, log them or plain drop them. Making sure that the user can't know why something failed and has no opportunity to take proper action.

[-]

davidalayachew@reddit

What do you mean when you say re-wrap? I know wrapping an exception, and that is not only a good thing, but encouraged if you can meaningfully provide context about the problem. For example, your CSV Parser might fail to parse a date, but wrapping your exception further up the call stack to include the line that failed will probably save you a very good chunk of time when it comes time to fix the issue.

[-]

sweating_teflon@reddit

I agree adding context is a good thing but I have seen more than my share of truely useless try/catch/rethrow. Many many developers are absolutely clueless about good exception handling practices and just mindlessly repeat the same pattern.

[-]

davidalayachew@reddit

I agree adding context is a good thing but I have seen more than my share of truely useless try/catch/rethrow. Many many developers are absolutely clueless about good exception handling practices and just mindlessly repeat the same pattern.

That's fair. The only time I rewrap without adding a new message is when the literal type of my Exception does all the talking for me. Something like IllegalArgumentException, where the argument can be easily deduced by line number. And even then, I consider that lazy because it makes code harder to read later on, when new lines are added (and if you are running multiple versions of a library).

[-]

Dragon_yum@reddit

Cascading exceptions is not fun and is pretty ugly. It is also very important because you can actually easily track the flow which causes the errors.

[-]

vegan_antitheist@reddit

It's just too bad that assertions aren't enabled by default on Java. Too many don't like them because they aren't enabled unless you use an argument to start the vm. But it would allow us to use the "assert" keyword to show that this isn't part of the logic. It just asserts something.

[-]

Uristqwerty@reddit

The way I see it, you should recover loudly when reasonable.

Only the developer responsible for the logic error can be expected to fix the root cause, and until they do so, everyone else using their product must tolerate or work around the problem. And if the original developer's gone? Users are shit out of luck. If the buggy data got serialized to a file, then even fixing the bug doesn't retroactively fix the data either.

Then again, the only sane way to recover will sometimes be to abort an operation while preserving the data, or ensure data structures aren't left in a corrupted state so that the failure doesn't cascade into a complete application crash.

But whatever happens, be loud. Heck, you could even go so far as to alert() the first time a given recovery path is hit, if developers are procrastinating so frequently that nothing short of publicly shaming them to users will motivate them.

[-]

bwmat@reddit

If something 'impossible' has happened, I fear memory corruption of some kind, and so doing anything but loudly crashing seems incredibly irresponsible

I write a lot of C++ though

[-]

Gecko23@reddit

My gut reaction for 'impossible' things happening is that I haven't been given complete, accurate facts. That's based on a few decades of debugging stuff in production systems, it's almost always a lack of facts and not the fabric of reality coming unraveled. :)

[-]

bwmat@reddit

I mean things that are impossible given the rules of the language (assuming no UB has occurred) and how the code was written

Like a private variable obtaining a value no code with 'legal' access to it ever sets it to

[-]

jdehesa@reddit

I guess I agree in a broad sense, but the advice about what you should do instead is not very comprehensive. Take the first example in the post ("silent UI failure") - what should the program do? Yes, a violated invariant is concerning, but most users would agree they rather have the program recover from the situation as well as possible than just completely crash. The best you can do is log the error to some system like Sentry so at least you (the developer) are aware of the situation.

The problem really is that logic errors are not necessarily detected anywhere near their origin. Traditionally, you would use asserts for them during development time, get a big crash with a core dump, and resort to whatever remedial measures you can use, if any, for release. I do agree though that asserts are not used nearly as much as they should.

[-]

cake-day-on-feb-29@reddit

Take the first example in the post ("silent UI failure") - what should the program do? Yes, a violated invariant is concerning, but most users would agree they rather have the program recover from the situation as well as possible than just completely crash

The example just sounds like a website that failed to load some server-provided data. Let's be honest, we've all had problems like this on a website, and typically you just reload, because otherwise you'll be staring at a spinner forever. A good website would post an error toast telling the user that something failed, then try and move on, skipping the bad "message".

But because it's a website, it can't really "crash", Browsers themselves just fail silently all the time. I guess another option would be for the website to cause a reload itself?

[-]

bwmat@reddit

IMO we should aspire to the ideals of 'crash-only software', and use things like databases w/ ACID to store state, and then just crash if we find anything that looks funny

[-]

MetaMetaMan@reddit

I agree that the program should attempt to recover from errors, but that implies the error is a known possibility. But there are plenty of errors that happen unexpectedly, and those should fail loud and fast, lest the program write corrupted data, or present inconsistent data to the user. With a critical application, I think it’s essential to let the user know that the program ran into an unexpected state and has halted, and ideally has logged the error.

[-]

knightress_oxhide@reddit

A silent recovery is completely different than a silent failure. A silent failure corrupts data.

[-]

jdl_uk@reddit

In that recovery vs crash scenario, is a recovery possible and useful?

If not then fail fast applies and crashing is the correct course

[-]

ttkciar@reddit

Yep, this, 100%.

My practice is to log the error with an abbreviated stack trace, and then recover as best it can so operation can continue.

Relatedly, I frequently see caught exceptions dump a stack trace from the exception handler rather than from the point in execution where the exception was thrown, rendering it less than useful.

It drives me nuts, and implies that the programmer isn't actually interested in fixing bugs, else they would dump a useful stack trace.

[-]

Voidrith@reddit

My approach for a longtime has been to have unique int codes at every place my error objects are instantiated - eg if(!somecheck) {throw/return MyError(12345, "generic message...") - so i can do a simple search project wide on that number to see exactly where it happened - not all languages have useful stack traces (especially JS when using async/await or returning Err(E) in rust) so tracing it to a globally unique number is quite effective too - and can even be (mostly) safely returned to the front end without really giving away anything

[-]

timmyotc@reddit

Of course most of the time, the programmer has simply written more code than they've fixed at that point in their career

[-]

TimMensch@reddit

The problem is when an invariant violation leads to, for instance, corrupted save files.

I'm thinking of Corel Draw which did its best to keep running even after hitting a bug, but we would end up saving files with 1,2,3,4... after the name because it was so frequent that a file wouldn't be usable after saving.

So I'd say it matters how important the data in the program is. Having it fail hard and lose all data since last save might be better than trying to continue and overwriting the save with corrupted values.

[-]

cake-day-on-feb-29@reddit

Fail fast is already an established concept. Glad people are coming to the conclusion themselves (and thus learning through experience) but I have to ask, is this not taught to students or juniors anymore?

[-]

serjtan@reddit

Also, don’t hide errors behind a verbosity flag.

[-]

Thundechile@reddit

also: "A plea to make description to post mandatory".

[-]

bwmat@reddit

Yep

I added some ASSERT (abort in release mode) & ASSERT_OR_THROW_IN_RELEASE macros to our project several years ago, and started using them liberally, and I have no regrets.

One of our customers made us hand over our codebase to a third party, and they were annoying about C-stdlib asserts, asking "but how is this checked in production?", even though a large fraction of them were literally just they're to check for 'impossible' conditions which were very unlikely to occur, and more insurance against future changes.

One of the managers told us to just s/assert/ASSERT/g, and I argued passionately against it on performance concerns (hey, I'm a C++ programmer), but they did it anyways

I don't regret it.

[-]

BlueGoliath@reddit

A plea to stop silently handling segfaults.