How Software Engineers Make Productive Decisions (without slowing the team down)

[-]

BigHandLittleSlap@reddit

This kind of advice is great... if you have a large team working on a single product with sufficient usage that the metric curves are smooooth. Hence, any "dip" or deviation is a reliable signal of something and can be alerted on, investigated, or whatever.

Similarly, A/B testing, staged rollouts, per-user feature flags, etc... work a heck of a lot better if 5% of the user base is more than like.. one or two people.

In a 30 year career, I've only had the pleasure of working on such as "simple" system once. Once!

Everywhere else, for LoB apps with a couple of hundred users, of which maybe a few dozen log in per month, this advice just doesn't work.

The sad thing is that all of the large vendors like Amazon, Microsoft, etc... know nothing else but the millions or even billions of users scale. They can't even conceive the small to medium (or even large!) business that have specialise bespoke software serving a subset of some small internal department.

The tooling doesn't work. The advice falls flat. The load balancer pings and the security testing tools represent 99% of the requests logged. The signal is lost in the noise.

[-]

frnxt@reddit

Working in relatively niche industrial settings I have never in a 15 years career seen an app with more than a couple hundred, maybe a thousand users, so that definitely matches your experience. And issues can last for years before they are discovered: one of our customers recently found an issue upon upgrading... and it turns out, in some conditions, the issue was 100% reliably reproducible since at least 5-6 releases.

[-]

pohart@reddit

I've got about 300 users/week and 200/day on an app that's been live for 20 years. We've had thousands but not tens of thousands of unique users.

Got a user bug report in August for a bug we've never seen that looks to have been part of the initial release. There's a module available from two different paths and one of them only worked in very specific conditions that just match how they've used it.

[-]

Sigmatics@reddit

This one happens very often in my experience. Internal tools just end up over optimizing for the specific environment they are operating in, because why not.

When that environment eventually changes, or the tool is used in a slightly changed environment, things break.

[-]

pohart@reddit

Yup. And for all I know users have been training each other not to do it that way this whole time and 99% of them just know that's how it works.

[-]

Maxion@reddit

Heck, in some of our tools we have known bugs in production that just aren't issues because we can control the business processes. We will know in advance when the business process changes, so we can then validate the new usage of the app.

[-]

lookmeat@reddit

You are confusing two separate issues.

What you say is true for automatic detection. In small systems you work be checking everyone manually and making sure they can call you.

You push a change, then a couple hours later you get an angry call from a single customer that represents 70% of all your company's income: you broke them. You check, they're right: what you thought was a fluke was actually a big problem starting. Customers have lost ~half a million by now, and their rate is about half a million every hour their system isn't working because of the problem in your system.

Now what's a better scenario here? Flip a flag and call it a day? Make a PR that undoes the change (if you're lucky your know what PR to roll back, if you're really lucky you just flip a config/variable in the code somewhere, but that's following the advice that your say doesn't apply here). You then force push the PR and push an emergency release (as oncall you get to break the glass, lucky you that you were oncall when you pushed your PR, otherwise you'd have lost precious time coordinating with another engineer, or worse debugging code you're unfamiliar with or having to get permissions and support to push the fix). Finally the release gets rolled out aggressively. This whole thing could easily be an hour. Meanwhile you just simply flip a feature flag and turn it off everywhere. Better yet you press a big red button and ask the must recent changes are undone, no need to fix it.

Next time, you use feature flags. Not because you want A/B samples, but because you want to first send a change to everyone expect the whale and then go from there. And I'd you see an issue you undo it quickly. Hell you realize that your own company is a big user of the code, so you first release only to internal users within the company: congrats you've built a poor man's canary.

The large companies you say, and the system with enough data to be smooth is great for automated detection. Here the problem is not that different, except now you lose $500k every ten seconds, instead of every hour. This justifies investing work into reacting 5 second earlier.

But let's be clear, you still want an easy way to undo any change you do, because it's really painful when you fuck up. Smaller products have less leeway to fuck up.

[-]

Relative-Scholar-147@reddit

I just started to work in an app that makes 1gb of logs per day, microservices, rabbitMQ, etc. It is a webapp that is going to be used by 5 guys a day max.

[-]

Markavian@reddit

1GB logs per day

Cold read: I suspect most of those logs can be converted to metrics; and any additional or interesting log state would be better stored as progress state in a database.

[-]

Relative-Scholar-147@reddit

The logs are the microservices making health checks. 1 GB per day. But hey microservices!

[-]

JorgJorgJorg@reddit

log at DEBUG and only enable the debug level when needed

[-]

Relative-Scholar-147@reddit

I have a question:

Did I ask for any advice about basic logging or you just always think you know better than others?

[-]

nonsense1989@reddit

Who the hell pisses on your cereal? Did you get personally called out for wasting time at retro or something?

[-]

esperind@reddit

its the response of someone who has already been asked many times why his log is so big

[-]

nonsense1989@reddit

Yea, skill issues. Read his first comment, 5 users 1GB of log per day.

Jesus fucking christ

[-]

lolimouto_enjoyer@reddit

Rookie numbers, one of our teams hit 100gb a day with no users at all.

[-]

nonsense1989@reddit

Dude literally have more microservices than users

[-]

nonsense1989@reddit

Yea, skill issues. Read his first comment, 5 users 1GB of log per day.

Jesus fucking christ

[-]

non3type@reddit

It’s only natural to want to help someone clearly out of their depth.

[-]

chucker23n@reddit

That escalated quickly.

[-]

itsa_me_@reddit

Yeesh

[-]

JorgJorgJorg@reddit

someone else reading may learn something. Sorry about your particular situation and codebase.

[-]

Relative-Scholar-147@reddit

Sorry I tougth this was programmingcirclejerk

[-]

Stasdo12@reddit

thx 🙏

[-]

QuineQuest@reddit

Upvote button didn't work?

[-]

strategizeyourcareer@reddit (OP)

I hope it's useful!

[-]

MMetalRain@reddit

I think problem is often other way, thinking you need to have reversibility when it's much faster and cleaner to do the irreversible change.

[-]

FlashyResist5@reddit

Does no one proofread anymore?

I’d slow down on purpose: rehearsal in non-prod environment

[-]

nerd5code@reddit

Oh, go on, slow them down.

[-]

ConscientiousPath@reddit

The problem with so called "reversible" decisions is that they are often made irreversible by later unexpected decisions.

Luckily 98% of what you want to do has been done before, so the better way to make decisions is just to look for how others have done it and then look for whether they still thought it was a good idea afterwards.

[-]

JollyRecognition787@reddit

The illustrations make me sad.