What are some unforeseen / elusive edge cases you have seen in your career?
Posted by gobuildit@reddit | ExperiencedDevs | View on Reddit | 151 comments
Hello fellow devs
I would love to read some stories of insiduous edge/corner cases that you encountered in the wild while building software. How did you encounter it and what lesson could you share with community?
TylerDurdenFan@reddit
Back in the mid oughts I was building a system that collected files for data processing. The files were call detail records created by already old telephone exchanges. Some file repositories were accessed via a primitive FTP, so tricks had to be used like "list the file size 5 minutes apart to make sure the file is done being written to".
A month after go live, and while still on hypercare, the file collection module wasn't working (which meant billing wasn't working). I spent the whole day chasing down the issue. Which was in the Apache licensed FTP library we were using, which failed then file listings contained the date for that very day, February 29.
I reported the bug upstream, but for our system I just wrote a very quick and dirty fix that ensured it processed that days files before the day was over.
I didn't stay long enough there to find out what happened 4 years later, but I know the system is still in production today, so it probably worked out well
rish_p@reddit
newsletters cron job not sending out emails because of daylight savings time skipping the actual send out time
BlazingThunder30@reddit
That's a bad Cron engine. Don't most account for that nowadays?
opinionsOnPears@reddit
I feel like that would depend on the time zone you set? It wouldn’t happen set to UTC?
reddit_man_6969@reddit
GitHub Actions does not 😕
DWALLA44@reddit
Not quite the same thing as the question but i love telling this one. We had a bug where when we opened a multi-step side panel in angular, and then went to the second step, the whole page would force scroll and we'd lose the navigation and there was a big white bar on the bottom of the page until a refresh, visible part of the page still worked though.
The fix was pretty simple, we had some scrollTo behavior that just didn't work because it couldn't find a container to attach to until it found the main app container, but the intriguing part was that the code hadn't changed in YEARS.
So understanding that took a few hours and a lot of git bisect. The bug came about because of some accessibility work that added 4 PIXELS of height to a couple buttons in our footer, which was enough to push the page below the fold, for the rogue scroll behavior to actually scroll.
BoringAsparagus701@reddit
Don’t use memcmp on a struct with padding, especially if that struct is used as the key in a std::map. You’ll end up with non-deterministic order in the map.
What makes this even more fun… the bug goes away in debug mode because the debugger compiler plays it safe and zeros out the padding. A “heisen-bug”. The moment you look for it, it vanishes.
demosthenesss@reddit
After several months of dev investigation we learned the reason our graphics was showing intermittent issues on arm hardware (via Qt) only was because of some weird glitch related to a usb mouse being connected.
This mouse was only being used for testing purposes, so the entire bug was never a real issue outside testing.
VoodooMonkiez@reddit
Why does this sound familiar to me…. I swear I have heard about this bug before…
dash_bro@reddit
Most interesting one I'd seen recently was a latency spike on something that was just rendering markdown and sending it as a stream.
It'd sometimes take double the time and I accidentally caught it while working on an unrelated part of the codebase, just simulating it locally.
Turns out, the error thrown was not caught and handled and instead traversed all the way out of the scope in a broader context, and just hit a retry instead -- and only a retry was logged instead of a markdown render failure.
Fairly simple fix once we found it, but it's such an odd non repeatable case that had always been present for 2+ years until I chanced upon it, serving 200k+ DAUs
hobbycollector@reddit
Hardware error. Under certain circumstances, rather than a straight jump to the location indicated in code, it would fetch the next statement after the jump and run it first. But not on every machine made by that manufacturer.
gobuildit@reddit (OP)
Branch prediction gone wrong?
hobbycollector@reddit
I assume so. This was in the 80s, and we ended up having to get an in-circuit emulator to prove to ourselves it was neither our error nor a compiler error. We sent a small program that exhibited the issue to HP and they were able to reproduce it. Sometimes.
St34thdr1v3R@reddit
How did you debug this? Sounds hard to debug 😅
selekt86@reddit
Segmentation Fault
PsychologicalCell928@reddit
Wrote a Monte Carlo simulation for interest rates and FX rates. It generally performed well but there was something off in the results. During testing two portfolios containing the exact opposite positions should have netted to zero. But that wasn't happening. We kept getting a small residual exposure.
Kept paring down the portfolios, then pared down the number of simulations, added LOTS of tracing data.
After weeks I isolated it to a bug in the DEC double precision floating point library.
In certain circumstances adding a very small positive number to the very small negative number of equal magnitude did not yield zero. (1e-09) + (-1e-09) would leave a small positive result.
Isolated the bug, created test cases to verify the problem, and we submitted them to DEC.
They were able to recreate the problem & issued a software fix to handle what was a hardware defect.
_________
In order to make progress in our development while waiting for DEC to identify and fix the problem we defined a variable EPSILON as 1e-08 and changed our code from
if ( (a + b) ==0) to if ((a + b) <EPSILON )
GravelWarlock@reddit
Isn't using a floating point for currency a bad idea, exactly for this reason?
SHITSTAINED_CUM_SOCK@reddit
Yeah I'm confused by this. Dealing with high precision data I've always used doubles or treated data as two ints (2.3 is 2 and 3). Floats feels like a terrible idea but I'll also admit I don't work in high-volume high-precision data so I'm sure there's things I'm missing... Surely.
await_yesterday@reddit
It's a Monte Carlo simulation; normal to use floats here, even in a financial context. It's not doing cent-for-cent accurate accounting, more like "we predict we should allocate 2.92% of our portfolio to XYZ in Q3 20XX".
And on top of that, the usual reasons for avoiding floats don't apply here. This was a hardware bug, not the documented weirdness of IEE-754.
SHITSTAINED_CUM_SOCK@reddit
That makes sense. Thank you for the explanation.
k_dubious@reddit
I work in fintech and we always represent monetary amounts internally as integer cents and then convert to decimals only for displaying to the user.
SirDale@reddit
Fixed point or decimal is better.
ShoePillow@reddit
I think it is quite common to have float/double utility functions in the code for comparisons, isn't it? Or have I just seen some really old code that needed to do this in the old times?
lordnacho666@reddit
Very early on in my career I was debugging VBA (I've since recovered, thanks).
There was a script that simply died on a certain line. No errors message, who would need that, right? It just... Stopped after that line, no more execution.
I spent the whole day trying to nail it down, but could find nothing. In the end, I retyped the line and it worked. Looking at the saved version, I had replaced the letter "l" (L) with "1" in one of the variables.
Super productive day.
Crafty-Pool7864@reddit
I had a variant of this while working on a text editor. We output invisible characters to handle certain edge cases. We spent a long time verifying that doing so couldn’t hurt us.
No one anticipated that someone (me) wouldn’t accidentally copy/paste one of them into our source code.
That’s a day of my life I’m not getting back.
riffraff@reddit
Been there, I had accidentally typed some weird invisible character, retyped, it worked, and I could see the unseen when diffing.
iammobius1@reddit
Fonts matter. I've had an iteration of this bug: there are two kinds of quotation marks, the regular and forward/backwards quotation marks. That was a fun regex to debug.
NeitherEchidna3491@reddit
I've had a permutation of this with rich/non-standard chars being copied into strings that were pasted into the IDE but didn't break anything until runtime. Along similar lines I once had an issue in Python because of content in a block comment, turns out those are also "code" and not something that just gets filtered out at build time.
Bugs like this make me yearn for a day when I can retire to the mountains and raise geese.
Bright_Start_9224@reddit
That's my kind of mistake.
ShoePillow@reddit
You were a 1337 coder
JandersOf86@reddit
I feel that, hella. Have spent hours on a personal coding project not understanding wtf was happening and it turned out to be a misspelling in a string that did not cause compilation errors. I think I just kept passing over it subconsciously.
RandomIndecisiveUser@reddit
Not one but two separate instances of overflow. One was an int overflow when converting a time duration to milliseconds - it always zeroed out after approx 25 days. One was an unsigned long in a proto field which was being set with the default, -1.
Significant-Duty-744@reddit
I was working on building playwright tests for some of our websites, these websites use feature flags to determine what pages/features should be available to users. Naturally we wanted our playwright tests to check these flags prior to running a test, which involved installing a node sdk for the feature flag library. So we get it integrated and deploy it out aaaaand it fails the test run because it can’t find any tests. Mind you these tests worked prior to adding the integration. So it works locally but not deployed. Well we spend some time investigating and decide that it must be an issue with our proxy. So we spend hours trying different things and talking to our networking guys and we finally figure out that our proxy is denying the traffic to the feature flag SAAS because of bad credentials. So now we spend time trying to figure out of our credentials are being injected correctly into our docker image and everything looks fine, perfect even. It’s inexplicable why this doesn’t work. Finally I decide to pull up this SAAS’s source code for their node sdk in GitHub and walk through the entry point I’m using to where the proxy communication is occurring and I find something odd, the auth header for the proxy request is using a JavaScript template string (‘${myVar}’) - except it has an extra trailing “}”, so my proxy credentials would always be added to the headers of the request with a trailing “}”, making them invalid. Hours and hours of questioning my sanity, turns out it was a defect in a library for a product we pay for…
stoopwafflestomper@reddit
After weeks of outages to core systems, following a deploy, it's was finally discovered when redis cache triggers it failover to fetch from DB instead of cache, it produced an smtp error. So when a key is evicted or expired, hundreds of clients requesting said key, failed back to DB calls, and produced hundreds of smtp error emails. This triggers Microsoft sending/rate limits. AND that triggered the CPU to spike to 100% while it worked through sending all those emails. Took about 45mins of you didnt kill the jobs.
matthra@reddit
We had an internal product that was vulnerable to sql injection, we found it, spent a few days panicking, and got it fixed. The next day a department called us saying their tools no longer worked and asked us what had changed. We were confused because the sql injection fix was the only thing we had rolled out in a week or so, so after a lot of troubleshooting we put a trace on and monitored what their app was sending. It turns out that they were using the sql injection vulnerability to make an end run around all of our processes and controls. The following conversations were pretty interesting, with a VP asking us if we could put it back.
gobuildit@reddit (OP)
SQL injection is a feature not a bug 😀
ShoePillow@reddit
If you're lucky enough to have users, any bug is a feature for someone
backFromTheBed@reddit
Hyrum’s Law
untetheredocelot@reddit
The number of V2 apis with the same contract as V1 in my company is a testament to this. We can’t force clients to migrate away from old possibly buggy behaviour.
AtomicScience@reddit
XKCD 1172
space-to-bakersfield@reddit
Reminds me of working on a team way back in the day when I was a junior that published campaigns to our website through a CMS. Sometimes we'd get requests for things that just weren't possible with the tools in the CMS, so we had to make code changes and push a new release to support. That process was painfully long though. Everything had to go through rounds of review by QA, the DB team, and another group that handled load testing, the whole deal. Then our manager had to go in front of a committee that only met on Tuesdays to get approval. It dragged on forever and made timelines way too long for the fast moving marketing campaigns we were supposed to be supporting. Meanwhile our internal clients kept asking for more. They wanted more interactivity, custom forms, all that kind of stuff, and the CMS just couldn’t keep up.
So in one release we cracked open a little hatch that let us publish raw JS through the CMS...oopsie! Our department then got a third party host tied to a subdomain, set up a LAMP stack, and we started dropping fully built sites there and pulling them into CMS pages through that JS field. Problem solved lol
AJB46@reddit
The first paragraph sounds like the release process for health insurance.
FlipperBumperKickout@reddit
This is just insane 😅
AfraidOfArguing@reddit
I saw a bug in a previous jobs react app where a div wouldn't properly overflow with scrollbars if you had MacOS scrollbars as transient instead of always available
gobuildit@reddit (OP)
Across all the browsers?
AfraidOfArguing@reddit
As far as I remember, yes. If you didnt have scrollbars always on, it was 0px tall.
etTuPlutus@reddit
Date picker on a sign up form was accidentally coded to use some weird octal conversion for the month. I don't remember why, I think it was something like a C# programmer writing JS and not realizing JS would autoparse leading zero values into an octal.
For a week after launch, we had this lady trying to sign up that kept getting an error entering their birthdate, but we couldn't reproduce it. The ticket never mentioned what date they were using. So, we tried every date edge case we could think of. Feb 29th, Feb 29th on a non leap year, Jan 1, Dec 31, etc. Nothing gave an error.
We finally got the customer on a screenshare. They fucking picked September and immediately errored. 09 is the only month that is an invalid octal value in JS. (Technically 010, 011, and 012 were storing wrong values, but because they were valid octals it didn't break the form). It never occurred to us to test edge cases for other number systems.
These days, I'd probably just write some Playwright code to test every single date and call it a day.
seven_seacat@reddit
oh I've seen this exact same issue! Freaky
User_Deprecated@reddit
Some counterparties send FIX body tags out of order compared to the spec. Checksum still passes since it's just summing bytes up to tag 10.
If your parser is strict about tag ordering, you just end up dropping messages that are otherwise fine.
Spent way too long debugging why one broker's fills were disappearing while everything else looked fine. Turns out the spec and what actually gets sent in the wild are not the same thing.
PsychologicalCell928@reddit
This might qualify as unforeseen by the designer but definitely foreseen by me.
A customer did a project to build a distributed FX trading system. The system itself ran on a local Novell network with a dedicated machine supporting an X.25 connection to the outside world. In addition there was another dedicated machine that acted as the local database for static information like the names, addresses, and network addresses of the different counterparties.
However the vendor decided not to use a 3rd party database like ORACLE but wrote their own storage system based on linked lists.
In reviewing the design I called out that the database was unlikely to support the type and volume of queries & that they should consider getting a 3rd party database rather than writing their own.
I was soundly voted down and quite rudely to "stay in my lane" which was production support and training. ( It was a contract gig I'd taken to fill some dead time. )
The vendor delivered the system and everybody was congratulating themselves after a successful demo. I noted that the vendor only ever picked from the first ten counterparties ( all starting with "A" or "B".). I asked the presenter to pull up the counterparty list & typed in 'Zaire Bank' which I knew was one of the banks supported. The system ran for a minute, then five minutes, then 15 minutes; all the time flashing a little light when the next record was retrieved.
There were about 300 counterparties in the database and each record required 5-10 seconds. So to get to the Zaire Bank was going to take somewhere between 1500 and 3000 seconds ( i.e. 25 and 50 minutes ).
There were other instances of retrieving data that demonstrated the same problem - for example, asking for a trade history that was greater than a few days - which was a pretty normal request.
By the time the customer and the vendor agreed on a redesign and reimplementation the market opportunity was gone.
gobuildit@reddit (OP)
Sounds wild to me that they decided to roll out a custom storage for such an application.
PsychologicalCell928@reddit
The company that was hired had developed a nice little product to allow LAN applications to communicate over a WAN. It worked quite well if the problem was wanting to access a file from a satellite site.
The old saying, “when all you have is a hammer then every problem looks like a nail” definitely applied.
The other things they did well were to get paid early in the contract AND push responsibility for design flaws onto the customer.
SoCalChrisW@reddit
I had to deal with a bug once that only showed up on a machine running the Japanese version of Windows with the German language pack installed.
That one sucked to work on.
untetheredocelot@reddit
Man fuck localization.
I had to migrate our internal localization systems backend (huge company so we have a internal team for translations) the way to do this was to take our existing applications’ string manifest and onboard to a new service which with the aid of machine translation re-translated everything.
The fucking issue is I don’t speak 28 languages and I had to verify that nothing changed…
Also same time I had to chase down a user locale preference bug that was causing failures in French and German and a few random other locales due to this legacy migration.
I got dinged for delays in my performance review… I am furious about it.
binarycow@reddit
What was the bug?
A_happy_otter@reddit
My guess is a window didn't properly format the German length words visually. German text has so many characters and Japanese windows expect fewer
JohnDillermand2@reddit
Triggering.
Had something many years ago where we were consolidating a bunch old systems, but very basic stuff with a database storing a file name and a directory holding that file. And there are those ten or so characters that are illegal to use in a Windows file name.... Except for Japanese windows. And of course this isn't picked up on until the 11th hour. Ok maybe I just hand correct a few and move on with my life. No it massively systemic, an entire office culture that was using emojis in file names for years as their own way of searching for things?? (Or at least that's my running theory). :-/ just seeing that still triggers me.
johnpeters42@reddit
ENSCHULDIGUNG!
OkDesk4532@reddit
Krautsayonaralat
scalarDE@reddit
Glückwunsch!
KnowingRegurgitator@reddit
Once had a memory leak in our c++ library. It took me a day or so to figure out what variable initially allocated the memory. But that code was like five levels down the stack from the main function that then passed the memory to about 5 other functions with variable levels of nesting. Luckily the original author of the code (which was about 15 years old at the time) was still with the company and was able to help.
Ok-Entertainer-1414@reddit
Frontend code. Some calendar stuff relating to days of the week was using an API to get a string representation of the weekday name like
"Monday", treating them like enums. But actually, the API returned a localized string in whatever language the user's browser was set to. This made a string comparison always evaluate to false because the non-English values were being compared to English values.This was in a service where users paid for something scheduled on a particular day of the week. Users browsing in non-English would erroneously see a message like "sorry, there is no availability on any day right now, check back later" and leave without buying anything.
But since this happened in a way where it wasn't obvious to either the users or the company that a bug had happened, this went unnoticed for like a year, and we only found out cause I just by chance was reading the relevant lines of code, and went "wait, wouldn't this silently fail if this returns a localized string?"
thezeno@reddit
A priority inversion problem where a low priority process ended up starving a high priority process for time, resulting a stalled graphics pipeline.
Podgietaru@reddit
I imagine this is not that uncommon, but it taught me a lot very early on.
The first codebase I professionally encountered was a very old Java Swing application that hooked into C++ in order to perform some intensive calculations.
Because it hooked into C++ any problems would cause the application to just terminate. Sigfaults. No stacktrace or anything.
So - I hooked the debugger in, and followed everything through. It was an old application. Which meant a LOT of magic numbers. But the thing was, I couldn't for the life of me get the crash to happen in debug mode.
The problem was reused or unused memory. And in Debug the debugger was filling this with debug information. So one magic number check was passing, where it production it wasn't.
I spent absolute HOURS on this. I was so so confused by it. All because I wasn't used to dealing with raw memory like that.
NeitherEchidna3491@reddit
Being unable to repro race conditions in a debugger or with additional debug code because it subtly changed the runtime performance/behaviour is rare but has probably happened to me more than once :')
ShoePillow@reddit
Would something like asan or valgrind have caught it, you think?
BlakeIsBlake@reddit
Yeah, I had the same thought... gnarly bugs have taken me months to figure out. Hours is nothing
gobuildit@reddit (OP)
Heisenbug!
melgish@reddit
In the late 90’s there was a bug in the Oracle ODBC driver that would occasionally throw an exception with the message being “Command completed without error.”
gobuildit@reddit (OP)
SuccesfullyCompletedException.
Perfect-Campaign9551@reddit
Far too many bugs in these comments of people using floating point for things it's not meant to be used for. They got what they deserved.
Notary_Reddit@reddit
Did you know that floating point additions are not communitive? i.e a+b=b+a isn't always true when looking at millions of floating point additions that happened in different locations. Having to explain why were ignoring "differences" in expenses was a fun conversation.
Perfect-Campaign9551@reddit
Again, why the hell are people using flooring point for money? You mentioning expenses.
You get what you deserve lol
evincarofautumn@reddit
Worse, floating-point addition is commutative, except for NaN, but it’s not associative, so ordering still matters, but in a more subtle way. You want to group values of similar magnitude, or keep track of the error like in Kahan’s sum algorithm.
Askee123@reddit
Invisible characters in filenames were a fucking doozy to deal with
user08182019@reddit
Remote API that used eventual consistency but didn’t disclose it. So you give their side something to persist, then ask to read it back and it would randomly be there or not, but by the time you manually checked it was always there. So had to build retry with backoff coefficient. Awful experience.
My lesson is, “just use x, we don’t want to do NIH / we don’t have the resources to do it ourself” you’re still gonna do it yourself, but now the app exists in the form of an integration. Depending on the fit and assumptions, that can become more painful than just making the thing, and you often don’t find out until you’re deep into the integration.
csgirl1997@reddit
Our data and caching infrastructure was DDOS’ing itself. Long story short, my team built out an infrastructure tool/set of libraries to handle data retrieval/caching of some very very large, but frequently changing data. There was also this sort of allowlist system that controlled which environments/app instances were allowed to load specific data.
Originally it was intended to be used by only one team, but it quickly caught on and a bunch of other teams adopted it. (Perhaps too soon)
A few months after a bunch of teams went to prod with this tool, all of our production deployments started failing due to database timeouts.
Someone sees a log from my team’s code in the stack trace related to the timeout. We consult with the database team. They’re doubtful our infrastructure tool is putting enough strain on the database to cause this.
I start looking at the logs just in case. I see our system being called from a set of production environments I haven’t seen before.
The logs is indicating that the system was attempting to reload the data from the database because received a request for data, but there was nothing in the cache.
I look at the incidence of the log. It’s being logged nearly 100 times a second. I look at the query that’s failing. It’s retrieving the entire dataset in one go.
And then I look at the code and the database. Come to find out:
Therefore, there was never any data in the cache, so every time it was queried, it attempted to reload the entire cache, and then found nothing.. 100 times every single second.
I allowlisted a single data entry to be picked up in that environment. Almost instantly, the 100 database calls per second stopped. And our production instances could start again.
Ambitious-Garbage-73@reddit
The class of edge case that still makes me paranoid is when the data is perfectly valid, just not valid under the assumptions your code quietly made.
One boring example: names or identifiers that look identical to humans but are not the same string because of Unicode normalization. Everything passes in staging because the fixtures are ASCII, search works for 99.9% of users, then someone copies a value from a PDF or from an iPhone keyboard and suddenly dedupe, auth lookup, or a payment/customer match starts acting haunted. Nothing is down, no exception is thrown, the database query is technically correct, and the support ticket reads like the user is doing something impossible.
The lesson I took from that kind of bug is that validation at the boundary is not enough if downstream systems compare values differently. You need to pick a canonical form early, log both the raw and normalized form when it matters, and have tests with ugly real-world strings instead of only nice English examples. It feels excessive until the first time two strings render the same in a dashboard and only one of them can be found by your backend.
gobuildit@reddit (OP)
Thank you for sharing the lessons!
corny_horse@reddit
I was a database administrator early in my career. Working in an Oracle database... someone in the 80s wrote some code that basically got compiled to run in these systems. Well... fast forward to 10 years ago and I could'nt get some data to load, but only in prod, even after a clone. Long story short, it turns out that the compiled code had a hard fail for only the date I was running and for only prod to test to see if some logging software caught the exception in prod. That was a fun bug to figure out...
nana_3@reddit
Working on an eftpos machine software. QA contractor decided to tap his work ID / door card for an unrecognised card test case (we had provided some unrecognised cards to test with, that wasn’t one of them though) Hardware / firmware flipped out, crashed program, nothing would work until full reboot.
We never did solve that one lol. Decided if the user was trying to pay with their work ID we could rightfully ask them wtf was their problem.
binarycow@reddit
I had a bug that only occurred if a breakpoint was set* in a specific spot.
* And the debugger paused on it
Head-Bureaucrat@reddit
I work with a company that primarily works with regional software (it's for a local utility.) The two biggest ones I've seen are:
A form that took phone number in (xxx) xxx-xxxx format. We kept having accounts not respond to certain notifications. After something like a year, someone finally called to complain. They lived in Japan and their vacation house was in the utility's service territory.
When reporting that your power was out, a specific account condition (you had to have more than 2 points of service on the same account, so think a shop with it's own dedicated electric, which is something like 1% or less of the customers,) caused the server to churn through many additional records. Not a problem unless several people were reporting, and our code serialized the processing. 🙃 I actually found the issue (only thanks to randomized data in a load year I did,) but was assured it was too rare to deal with. The CEO, with a very large shop or lake house or something went to report an outage during a storm and saw the error.
dugopark@reddit
I was code complete and testing my change on a distributed computing application with heterogeneous server types. Getting time on the test system was scheduled by slots. For the life of me I couldn’t figure out why the servers my code was running on would segfault and fail during full system testing.
Unit tests were passing. Code coverage was good, so I knew I wasn’t missing a unit test. I went through it line by line until finally I spotted it. I’d added a debug logging message in case a function parameter was null. Key part though was that I was assigning the variable and performing a null check within the context of the debug logging message that would print something like “this should never be null”
I checked the header file for the debug logging message and punched myself in the face. Debug log messages are ablated in optimized builds, including any code it wraps. Full system tests build optimized binaries. Unit tests do not. The assignment was never happening and code later in the function was accessing uninitialized memory.
I learned an important lesson that day. Sometimes the bug is in your head. Your mental model of how something works could be wrong, so when you’re at your wits end, work through and prove out your assumptions.
Or use Claude or ChatGPT. I don’t care :)
random314@reddit
Around 2006/2007 ish. Worked for an airline procurement company, Zimbabwe went through hyperinflation and our currency conversion library bugged because it ran out of digits and was calculating prices wrong.
gobuildit@reddit (OP)
Nasty! What was the real world impact?
new2bay@reddit
It sounds more like a case of the real world impacting the software than vice versa. In 2007, inflation in Zimbabwe was estimated at 662,000%. The following year, it reached 2.315 billion percent. When prices are changing that fast, it doesn’t really matter what you’re charging; it basically equals 0 by the time everything works through the system.
SirDale@reddit
I have a one hundred trillion dollar note from their hyper inflation days.
dudeaciously@reddit
We were processing files for a bank. Then our awesome software was used for a second and third bank. THEN the requirement came in, what if they mix up folders as to which bank is meant for which folder. So we had to add a check we did not consider previously.
dbxp@reddit
Years ago we had some reports which had been in production for a while working fine suddenly start corrupting. The figures would just randomly change half way through the file. Then we realised what idiots we'd been, the excel workbook object was static. If you ran the report at the exact same time on two machines and the load balancer picked the same web server then it would merge the two files. It had snuck into a number of reports and wasn't picked up by any form of testing or users until months/years down the line.
mikeonh@reddit
Ok, here's another "edge case" - the random, middle of the night failure.
In the early 80's I worked for a vision-guided robotics company - some of the jobs were vision only.
One of our Detroit robot customers wanted to verify the accuracy of the assembled body shells our robots helped build, so we designed and built this large tunnel-like framework with lots of lasers pointed at locations on the bodies, and cameras with precise optics looking for where the laser spots hit, in order to precisely measure the tolerances.
The laser and camera modules were in heavy-duty housings - this was an industrial environment.
Sometimes - and only in the middle of the night - the cameras and/or lasers would shift out of alignment by a significant amount.
We tried everything, but couldn't figure it out. Finally, added a couple of extra cameras to watch the framework that the bodies were passing through.
Midnight shift workers were using the framework like a jungle gym, and the laser and camera housing were great foothold and handhold spots.
The union was strong. Our only acceptable solution was to replace the housings with extremely rugged ones that wouldn't move, no matter how much weight was placed on them.
Part 2, same company. Don't let any salespeople near your demo equipment. Large industrial single arm robot, for auto assembly work. Able to place an engine in a car body. Demo had a car shell and an engine on the arm's gripper.
The salesman wanted to do more demos than the authorized one, so he hit the normal "show all degrees of arm motion" demo. While it was still holding the engine block. Of course, the car body was in the way, and got smashed, very loudly.
He kept his job.
bwmat@reddit
Damn, lucky the guy didn't hurt/kill anyone in the second case
For the first one, the union insisted purple be allowed to use it as a jungle gym???
mikeonh@reddit
First case - standard safety perimeter with sensors, along with sensor floor mat.
Second - didn't want to fight union, otherwise they'd be hostile from that point on. Not my descision.
space-to-bakersfield@reddit
At a previous job, we'd get transaction logs FTPed to one of our servers by a partner company from where our processes would pick them up and process them. At one point, bugs started getting reported with this process, and we found the cause to be random records having asterisks instead of a transaction ID.
We contacted the other company, and they insisted that they weren't putting those there. They reran the process but the new copies of the files also had the asterisks in them. There was more back and forth, including them sending screenshots of affected records that showed them to originally have had the actual IDs in them. Then someone noticed that all these IDs were 16 digits long and started with 4400, so they looked like visa card numbers.
That got us on the phone with our infrastructure department. They initially knew nothing about it, but after digging, they confirmed that there was some process put in place years ago on the server that opened up files and scrubbed all numbers that matched common CC number patterns. It had just existed there, probably long after the person who wrote it left the company, and only became an issue when this other company's transaction IDs got up to one of the ranges its regex was looking for.
bwmat@reddit
I hate this kind of thing
Deleting bytes from files you don't understand the structure or use of because they 'look like' sensitive data is complete madness
We do this on our CI logs and is a pain in the ass, often censors information critical to debugging
gobuildit@reddit (OP)
Thanks for sharing! Props to the person who noticed the pattern, so easy to shrug it off as random.
seanrowens@reddit
I had a function that took some floating point arguments, latitude, longitude etc, and looped over them. The loop never exited. It made no sense at all, the loop was very simple. It should definitely exit sooner or later. It turns out that the function was being passed garbage due to a separate bug, very large values. Because of these very large values, the comparatively very tiny loop increment was lost to floating point round off. The loop counter never incremented and therefore the loop never exited.
bwmat@reddit
Honestly I've grown to fear any logic depending on floating point operations
seanrowens@reddit
Yeah I try to never use floats for looping now. If I need to loop over floats, I use integers and calculate the float inside the loop.
Nervous-Till4096@reddit
Samba, static memory OOM b/c someone had put a Map on an interface… Hardest one was where one out of every like 100 transactions through an API would take like 3 seconds instead of a hundred ms. Couldn’t reproduce directly against any node directly so turned out to be a bug in the F5 load balancer. Was working for Disney at the time so they just complained and got them to patch the F5 bug.
Schmittfried@reddit
I once debugged a JavaScript snippet for our custom scripting engine that just crashed at a line that was totally valid. Tried to find the bug in the internal C# classes under the hood. Nothing. Stepped into the IL that was generated by the JavaScript engine. Nothing. I finally bit the bullet and installed WinDbg to step into the native code that the IL was compiled into. Turns out the compiler actually had a bug (I don’t remember fully if it was a bug in the JavaScript engine or .Net framework, but I think it was .Net because I still remember I couldn’t really believe it) and generated opcodes that read from null pointers.
bwmat@reddit
I once spent a lot of time debugging some access violation crashes in some C++/CLI code
Eventually ended up w/ a time-travel-debugging trace which showed that code like
System::Object^ o = getValue(); if (o == null) out_val.SetNull(true); else { out_val.SetNull(false); const int32_t local_i32((System::Int32^)o); void* const valid_pointer_to_i32(out_val.GetBuffer()); memcpy(valid_pointer_to_i32, &local_i32, 4); }
crashing in the memcpy, due to the source of the memcpy being some unmapped virtual memory???
Luckily our company does pay for some support for Microsoft, and eventually they admitted this was caused by a 'memory hole in the garbage collector'
I tweaked the code a bit to make it do the same thing in a slightly different way, and that prevented whatever misguided optimization caused this (the GC would literally let the integer be garbage collected while the memcpy was occurring sometimes lol) and that fixed it
gobuildit@reddit (OP)
.Net is turning out to be very popular in this post.
pablosus86@reddit
I don't remember the details but accidentally making a property static led to sending several hundred railcars to the wrong places.
gobuildit@reddit (OP)
How long did it take to restore the damage done?
pablosus86@reddit
Thankfully they were sent to the right city, just the wrong place in the city and we found it before they got there.
IrishChappieOToole@reddit
I can remember working in a company with lots of microservices owned by different teams. We had CI/CD running where all services were first unit tested in isolation, then went onto an integration stage where they were all tested together. Unit tests had to pass to get to integration, and integration had to pass to build the docker images.
One late night, I was working on a super critical hotfix, and a service owned by another team was failing. I could see their code, but couldn't push. I can't remember the exact specifics, but it was something along the lines of subtracting one hour from the current time, and asserting that it was still the same date. So that test could never pass between the hours of midnight and 1AM. None of us who were still up could force the build, or fix the test. So we just had to twiddle our thumbs until 1AM when we could rerun the test
gobuildit@reddit (OP)
Midnight to 1AM is nap time 😀 No code shall pass through!
joshocar@reddit
Joystick control system. The contractor poorly spec'd the analog/digital to serial board. It was designed for sensors, not user input control. You had to poll it for a reading and if you pulled it faster than 6Hz it would send back partial messages and random ascii. The analog message was a list of number for the state of the analog inputs with no checksum. The digital side sent a decimal number which, when converted to boolean would represent the state of the 8 digital inputs. It also had not checksum.
During field tests the system randomly turned on some of the autopilot controls because of the junk data and almost caused a disaster. The contractor added in filtering for the bad data and added a change so that you had to hold down the digital button for 2 second before it would activate. Later in the year we found that randomly one of the autopilot systems would turn on during operations. This was a mystery as to why.
When testing the system we could not replicate the problem, it only showed up during operations and was very intermittent.
One key aspect was that the analog/digital signals got converted to serial and then sent through a serial network server that would convert them to network messages where the control system would read it. Polling commands made the reverse trip.
What was happening was that the serial network server was seeing a ton of network traffic from the rest of the vehicle and would buffer commands to the A2D and then send them in a burst. This only happened during operations because the network traffic was way higher than when we tested.
When the burst of traffic happened the A2D would send junk, which would get mostly filtered, but occasionally the filter would see a number and assume it was the numerical digital signal. This was fine because they had the 2 second press requirement before a control state would change. However, periodically, the system would happen to read a bunch of these fake digital state messages and one of the bits happen to be in the high state for a few of them in a row resulting in the state changing.
The mitigation was to put the serial server on its own subnet. We later rebuilt the control boxes with appropriate hardware.
thx1138a@reddit
Some hospitality software which fell over if you opened a report while plugged into a projector. Not great for demos.
Also some client server software where, if a particular PC user opened a file, the Amdahl mainframe crashed.
Finally dotnet 10, which hangs when running under K8s because apparently having a process with a PID of 1 is a bad thing.
gobuildit@reddit (OP)
The dotnet one sounds hard to trace, how did you figure it out?
thx1138a@reddit
By leaving it to a talented colleague!
gobuildit@reddit (OP)
Classic!
sillyhatsonly764@reddit
I recall having the same problem with a perl script in docker once. Happened because I forked inside the docker container and pid 1 catches a signal when the process exits. If you don't listen for the signal then the processes stay as zombies. A noop handler prevented the outbreak. I think. It's been a while.
hellotanjent@reddit
Worked for a PC game publisher in the early 2000s. Studio's strategy game was crashing randomly and we could never figure out why. Collected crash dumps for _months_.
Turns out one particular driver for one particular sound card running one particular set of sound effects would write random bytes into random areas of memory. We had to add a special patch to the sound engine to disable effects if we saw that configuration.
gobuildit@reddit (OP)
That's a tough one!
FlipperBumperKickout@reddit
A library we used would set up a singleton using a global variable for a singleton instead of using a singleton in the dependency injection framework. This fucked up a lot of unit tests which each initiated their own instance of the dependencies, but would screw each other over because the library used a true singleton.
bentreflection@reddit
Ran into an issue where a client’s password wasn’t working. Spent a lot of time trying to debug before discovering the client was in the unconscious habit of entering a space at the end of every word and was just incorrectly entering their password every time. No idea how this person functioned on a day to day basis
farzad_meow@reddit
we had users for a specific client log out randomly. very random behavior, other clients were not affected. we could not replicate it, no log or trace helped us.
i ended up getting a promotion to devop team that gave more access to actual aws accounts, that is when i found out that it was a cache eviction problem due to that client having a process that needed ton of cache. a simplr cache upsize solved a two year problem.
another one i had to deal with had to do with a vaguely defined business rule. we had to go back to the team and ask them to clarify the rule so we knew how to code for edge case. took a month to get approval from legal. in the meantime we made the edge case throw error and only allow admins do it.
Smok3dSalmon@reddit
On a python webapp, some function had an optional timestamp parameter and it was defaulted to now()
It passed the eye test and so nobody used the optional ts parameter and went with the default, the timestamps looked ok in testing too.
We found out that the default values were assigned when the application started. So everything had the same ts when we went started integrating testing.
Fix was to move ts = now() into the function
gobuildit@reddit (OP)
Re bug#1 - The python default arguments are counterintuitive and can cause unexpected bugs.
FlipJanson@reddit
Kept having issues where our server wasn't receiving requests from a customer's application and we suspected it was a customer's firewall (naturally, they denied it because it's never the firewall /s). Finally it took us and their dev, network, and firewall teams on a call doing packet captures on every device in play before we finally proved the firewall was dropping packets - revealed by a file named "dropped-packets.pcap" LOL. Turns out their application was keeping a port open too long and their firewall closed it from being used.
Rubysz@reddit
Recently I joined a startup that works on construction robots. Now I have literal corner cases 🫠
ShoePillow@reddit
What do you do when you hit a wall
NGTTwo@reddit
Or fall down the rabbit hole.
WanderingSimpleFish@reddit
Construct lighthouses problem soy
ShoePillow@reddit
Or the death star
liyayaya@reddit
We have a .Net service that monitors production machines and pushes records to multiple MQTT consumers. At the same time, we persist these records into an Oracle database using EF Core and a logfile on the server running the service.
The service usually runs fine for weeks or months, until we randomly encounter a bug with timestamps in the Oracle database.
Once this bug appears, all new records written to the DB start getting almost the exact same timestamp.
Example after the bug appears:
- We receive an event with timestamp 5 AM
- Logs show event timestamp as 5 AM
- MQTT messages contain the correct timestamp
- DB entries have 1 AM and a few seconds
Basically, all follow-up events after the bug first appears will have this timestamp of 1 AM, differing only by a few seconds, no matter when we write to the DB. In logs and mqtt messages the timestamp is always correct.
The method to write to the DB is basically like this:
public void WriteToDb(event : MyEvent):
eventRecord = new EventRecord(){} //set fields
db_context.Add(eventRecord)
db_context.SaveChanges()
We added logging at each step, and the timestamp in the event record is fine until after SaveChanges().
After that, the timestamp in the record is also messed up.
We still don’t know what is causing this. We opted for a nightly service restart. Since then, we have not encountered the bug, but I’d much rather fix the root cause.
If someone knows wtf can cause this, please help.
PsychologicalCell928@reddit
Here's another one:
We wrote a Videotex system that ran on DEC PDP-11's and then subsequently on DEC VAX equipment.
We used one machine as a dedicated database machine.
Periodically when load was high a query from the database would come back as incomplete but only when it was running on one of the machines.
There would be no indication of error. However the end user application would report a slightly incorrect result.
We tested the query running on each machine by logging onto it directly. Everything worked perfectly.
Finally after many different attempts at diagnosis failed we pulled up the data center tiles.
And there, connecting the suspect machine, was a very long, coiled internet cable. We took it out and measured it and it was about 10% longer than the maximum allowable cable length. The installer either mismeasured or decided that the cable was close enough in length.
Replaced the cable with one that was met the specifications and the problem went away.
midasgoldentouch@reddit
Oh wow - did the extra length mean a longer transmission time and therefore a higher likelihood of getting a bit error or something like that?
PsychologicalCell928@reddit
Yes the transmission time was longer but also there was a bug in the vendors software.
The software would set a timeout right before the query was sent.
It would clear and rest the timeout upon successful reception if a record.
The extra length opened up a small window where the first bytes of the record arrived simultaneously with the timer expiring.
It was a race condition that shouldn’t have existed because the transmission time on a cable that met specification would never have exceeded the timeout.
midasgoldentouch@reddit
Oh wow, it’s like double trouble. Did you end up tweaking the timeout as well?
UncurableZero@reddit
Recently assembled a PC and started getting corrupted TOAST data in PostgreSQL (basically in-memory data before writing to disk). Everything worked fine but this, until I realized I setup the wrong RAM frequency profile in BIOS.
A few years ago I worked at a company where separate projects ran on a single on-prem VM cluster. Under some circumstances overload of one service's VM caused some other random serivces to fail. (nois neighbors :))
gobuildit@reddit (OP)
Thank you for sharing! How did you debug the postgres issue?
UncurableZero@reddit
Started suspecting the RAM, ran a memory test through the whole memory space and saw it fail.
raserei0408@reddit
Years ago, I worked on an HTTP API that required users to get a session token to pass as a cookie to all future requests. The session token timed out after 30 minutes, enforced on the server side and with an expiration time set to automatically remove it on the client side.
We got a report from a customer that their script using the API entered a loop of requesting a new session token, making a request, and getting a response that the request did not have a valid session token. This lasted for 30 minutes, then returned to normal. This happened just before the the "fall back" time change in November. Okay, great, time change bug. But reproduction attempts failed except when using their exact script.
Ultimately, the problem was a bug in the Microsoft .NET library used by the client script for handling cookies - it would convert the expiration time epoch timestamp to a local time in a time-zone-aware way, but compare to the current time in a non-time-zone-aware way, so it would interpret a time 30 minutes in the future as being 30 minutes in the past and immediately expire the session token locally.
gobuildit@reddit (OP)
Framework bugs are so tough to track down! Thank you for sharing.
mikeonh@reddit
Be aware it could actually be a hardware problem!
Worked on a medical device using a 680x0 microprocessor. Very rare, severe failure, which is not acceptable in medical devices.
Turned out to be an interrupt, pushing context onto the stack, causing the stack to cross a page boundary and also trigger a page fault. Context on the stack was corrupted! We did feed it back to Motorola.
I've also seen multiple intermittent ram failures when not using parity / ECC, and issues with caching writes to disks that never actually got written.
Always, I mean always, use server grade hardware. I've worked with too many cheap startups that tried to get away with consumer hardware in production.
Finally, I've seen so many off-by-ones in storage allocation, use before initialization, use after free, and race conditions without proper locking - at least some of the newer languages help mitigate the older bugs.
Please have a subject matter expert as part of your design team - someone who actually knows how the customer is going to use the software. A bunch of junior / mid-level engineers creating stories does not substitute for actual experience.
Worked for a company that was developing a tactical handheld radio for military use. Ruggedized Ethernet port *on the bottom of the case*, right where someone would set it down into the sand. Had a blackout mode for all of the indicator lights - useful when operating at night in a contested area. However, when booting, the hardware *flashed all of the lights*, then it checked for the blackout mode state. The team didn't think it was a big deal.
Too many other stories from 57 years of experience for this post :--) Retired now, and thanks for the interesting question.
gobuildit@reddit (OP)
Thank you for sharing!
reboog711@reddit
Edge cases are so specific to domains I worked in and products I've helped build, that I cannot imagine anything I could share that would be useful without a ton of context, or internal IP.
I once worked on a shopping cart system, where the invoice tables referenced the product tables for price; so if a price ever changed, so did the order history. I'd argue this should have been forseen by the original architect and is not an edge case.
midasgoldentouch@reddit
😩 Yes, this is why we always create new prices and archive the old ones!
overgenji@reddit
this one pool would just start 500'ing every day in response to traffic spikes around the same time a couple times a week (the spike was expected, user notification mailbox inbox kind of thing), bug tickets filed, people took cracks at it here and there but just sorta left it alone because the metrics were confusing, there were rate limits but they weren't really firing, etc. load balancer. workload wasn't scaling up dynamically because the only real dimension we could scale on at the time was CPU load, but CPU load actually would start dropping at a certain saturation threshold, so it'd actually scale DOWN! load balancer logs/metrics didn't reveal anything surprising: just failing to find a backend to call, etc.
clientside sdk issue? CDN not hydrating correctly? whats going on with the CPU?
finally im poking around deeper on the system level metrics and notice the file descriptor counters are through the roof during the degradation -- turns out the pods were spending their time being i/o bound and degrading switching between threads.
the max threadcount on the configuration was set incorrectly and it was actually unbounded. i picked a reasonable limit for a service like this (\~2k or so) and now the service would both scale and also more requests could actually succeed and the CDN would hydrate faster and things.
there was still a traffic spike from actual users, but the clientside retries would actually succeed and the overall size of the spike the backend saw got much smaller and was much shorter as a result, and the cpu was actually utilized so some scaling could occur. we adjusted our alerts to not page us for this kind of spike and i got a pat on the back for solving a longstanding issue lol
LiveMaI@reddit
I had some code that worked in our lab, but would crash when we deployed it to our CM’s machines in the factory. It was some issue with character encoding and Pandas failed to read some CSV file because of the way that Windows handles character encoding when the system language is set to Korean. We had to have our CM change the language settings on their machines to use the experimental Unicode support in Windows to work around it.
cbunn81@reddit
We had a bug in a web app once that was only manifesting for a user of the Japanese version of Windows running Edge with auto-translate turned on.
This made no sense at first, because we localize the app for Japan. So why was it trying to translate the Japanese text into Japanese? It turns out that in the Next.js boilerplate HTML is the line
<html lang="en">. Apparently, the translation function of Edge was seeing that and trying to translate the text.reddit_man_6969@reddit
Frontend. Had a page where the page-level Loading component wasn’t displaying on slow connections.
Turns out that web browser was waiting to render the component until the page had fully loaded, kind of negating the point of that component.
Why was the browser doing this? Because the component was returning less than 200 characters worth of content and browser figured it was being helpful.
Solution was to add a hidden line of text (we inserted an Easter egg lol). Weird
shared_ptr@reddit
I was chatting to a friend recently about database migrations and how you need to be hyper aware of when you step outside of a safe zone, almost more so when you’ve invested in making things really safe.
Specifically: an incident comes to mind where our primary Postgres cluster locked up due to an update table that was written through database triggers (this was before logical replication existed) hit the max integer value on its primary key.
This sounds really obvious and that’s because it is! We had already gone through all the standard incidents for database migration changes and had written our own framework to produce safe migrations, ensuring we’d never do something as silly as creating a 32 bit primary key.
The reason this had happened was because the change capture system was written separately from the main application, which meant it existed outside the normal developer flow of modifying the database. When your team have got used to database migrations being default easy and uncomplicated it leaves blind spots if they ever step outside that process, and the team who built this system hadn’t even clocked they were doing it. It was just a new system and written in another language for good reasons, and didn’t catch that outside of the tools we’d already built, migrations in a database like this were very high consequence.
Has made me intensely aware of the safe paths to making changes in an engineering org and I keep an eye out for whenever anyone steps outside those zones nowadays.
shared_ptr@reddit
I was chatting to a friend recently about database migrations and how you need to be hyper aware of when you step outside of a safe zone, almost more so when you’ve invested in making things really safe.
Specifically: an incident comes to mind where our primary Postgres cluster locked up due to an update table that was written through database triggers (this was before logical replication existed) hit the max integer value on its primary key.
This sounds really obvious and that’s because it is! We had already gone through all the standard incidents for database migration changes and had written our own framework to produce safe migrations, ensuring we’d never do something as silly as creating a 32 bit primary key.
The reason this had happened was because the change capture system was written separately from the main application, which meant it existed outside the normal developer flow of modifying the database. When your team have got used to database migrations being default easy and uncomplicated it leaves blind spots if they ever step outside that process, and the team who built this system hadn’t even clocked they were doing it. It was just a new system and written in another language for good reasons, and didn’t catch that outside of the tools we’d already built, migrations in a database like this were very high consequence.
Has made me intensely aware of the safe paths to making changes in an engineering org and I keep an eye out for whenever anyone steps outside those zones nowadays.
Careful_Ad_9077@reddit
First the anti example ( to establish rarity); I once I had to convince another dev to ask for an extra day to fix a big on his code, the big could get fixed by tech support and it would take them between 30 to 120 minutes to run the queries.
The dev refused to do the extra work because that bug only happened 1% of the time, 99% of the time the code ran it would run fine.
How I convinced him to ask for extra time ? When we reached this point onth discussion I reminded him that while QA will only run the code once or thrice to test it, the client would run the code 300 to 500 times everyday, this creating 3 to 5 hours of work everyday for support. Not that he cared about support but he used those numbers to convince his manager to give him the extra time.
Now , about an elusive edge case. I don't remember the specifics but what was common to them was code that would run perfectly fine because the real world care had low to no concurrency, yet would break horribly if something like a power outage happened right in the 5 Ms the code was running.
The fun part is that has happened to me (my teams) like 3 -5 times in my career.
Fragrant-Brilliant52@reddit
Worked for a small news outlet. The data was pretty messy, so we had to rely heavily on tags from publishers. One of the tags was a level of “extremity” on a scale. I had a request to show more or less content depending on the user’s location, meaning some users might be shown more “extreme” content based on their IP origin.
ThirdWaveCat@reddit
Dealing with data intensive systems, like state machine replication from an assortment of remote sensor data with intermittant connectivity, is challenging when you want high assurances data loss is avoided. Property testing tools like 'hypothesis' or 'proptest' can be used abstract out network buffers into unit tests that adversarially try to break some invariant you define, like majority replication or the existence of a single-history of data logs (linearizability).
high_throughput@reddit
I didn't personally encounter it, but the DropAllTables call in the database library (normally used only by test frameworks) now takes a dummy constant that needs to match a hard coded value.
This was because the message type tag was a single freak cosmic ray bit flip away from a common operation, which is believed to be the root cause of a giant, dramatic failure after years and years of smooth operation.