Blameless Culture in Software Engineering
Posted by thehustlingengineer@reddit | programming | View on Reddit | 90 comments
Posted by thehustlingengineer@reddit | programming | View on Reddit | 90 comments
Chance-Plantain8314@reddit
We do this. It works in the 85th percentile. All "we", never "I". Fault Slippage is always "the team" and never "Bob" even if Bob really did fuck up - because ultimately there should be code reviewers and test loops between Bob and the customer.
It does, however, make accountability a nightmare if you don't have a good manager. I've had both sides of the coin and sometimes when Bob can't stop fucking up, he's still never held accountable.
Salamok@reddit
In my experience mediocre and below managers don't ever try to get rid of anyone unless its personal. One of a managers KPIs is how many people they manage so their excuse for a non performer will usually be "we don't have enough resources, I need more people. ".
pinkjello@reddit
So, I manage about 100 people in a F100 company that does stack ranking. Stack ranking gets a bad rap, and I hate it too but have no choice.
But it is a decent forcing function to avoid things like this. I am always looking for my lowest performers and those of my peers. People who aren’t even trying (or are truly incompetent). I shield people who make mistakes (we all do) and learn. But if you’re dead weight, even if I like your personality, GTFO of here. The rest of us are trying to build things and make them better, and it’s demoralizing to have freeloaders around.
Also, even if you’re stacked at the bottom, there are ways to come back if you try. It’s not a lost cause.
Salamok@reddit
There are so many different implementations of it that you can't really pass judgment on it as a whole but there are for sure really bad implementations as well as good. There are situations where management for whatever reason uses it as a tool to limit seniority and that just seems like a horrid environment. Then there are places that are huge that have done it for decades and you wonder at some point if they hit a peak and are running out of new hires that are better than folks they eliminated years ago (looking at you amazon). It can also be a really shitty way to ensure all your tribal knowledge makes it into the documentation after all you gotta make sure the constant new folks onboarding get up to speed asap. But at some level you would think you would want to empower your managers to go to bat for their team and justify no churn for the current round even if doing so was not the path of least resistance.
I have for sure managed teams where I wished I was given the excuse to easily remove a few folks but I have also been in situations where I felt wow this team is really working well together hope nothing fucks it up and we can keep this going.
domrepp@reddit
Yeah, no. I've also managed big teams in large companies, and when organizations rely on stack ranking it just tells me that leadership doesn't know what success looks like.
If you need to pit your team against each other to weed out the low performers, then you're failing as a leader to define for your team what success and failure looks like with clear, measurable terms. The only thing that stack ranking adds is a culture of insecurity that turns teammates against each other during rough times.
aanzeijar@reddit
The point isn't to shield Bob from consequences.
I'm fighting tooth and nail every time something happens that we first figure out the way forward and how to fix it because human nature seems to gravitate to finger pointing.
I don't care who did it, I care about where to go from there. I'm perfectly capable of using git blame to see who committed it, I still don't care. Hell I've sat in the same room with the only guy who has access and set up the thing that just broke in the exact way I told him it would break when he built it.
Still not interested in blaming before it's fixed and it's made sure that it doesn't break the same way again.
Afterwards you still can have a long talk about whether the guy should maybe get his access restricted.
Sigmatics@reddit
You have a point about first fixing then finding the cause. But if it's one person repeatedly causing issues, you have a problem
Familiar-Level-261@reddit
two problems.
The person might be a problem on its own but second problem is system that allowed the repeated fuckups to filter to production
anti-state-pro-labor@reddit
This exactly. The problem is a system problem first and foremost. Why does the system let Bob fuck up without any feedback before it hits a customer? Why does the system not alert us it's a problem before the customers notice? Why doesn't the system help Bob not fuck up?
Yes, fire Bob if they keep fucking up, sure. And any manager should be able to figure out Bob is the shared problem across all the issues the team is facing. But that doesn't mean the system isn't the root cause of the customer facing problems. Postmortems should blame the system, 1:1s should find out how the human parts of the system can be better.
Inevitable-Plan-7604@reddit
There's a limit to what you can do, especially in small teams/companies. It's easy to say "change the system to introduce a QA department, a product department, UAT guidelines, smoke testing, alpha testing", etc. At some point, it's part of Bob's job to learn. And when he doesn't there's no one else to blame but him.
Blaming the system just makes Bob cost even more to the company, especially if he's the only one repeatedly fucking stuff up
anti-state-pro-labor@reddit
Then fire Bob. I'm not against that at all. I just don't think the postmortem is the place to do that. I've never been a part of a team where during the postmortem we didn't find something actionable that we could do to make our system more robust. Yes, Bob sucks and we tell the manager that directly during a 1:1. I just don't see the value in telling everyone Bob sucks during the postmortem.
And if you have a hiring pipeline that continually hires Bobs, you have a non-engineering system that needs to be blamed. Which again, isn't Johns fault in HR or the hiring managers fault. It's a system problem and we can fix the system.
Inevitable-Plan-7604@reddit
Fair enough, we're on the same page. No, publicly shaming bob isn't going to achieve anything.
barrows_arctic@reddit
Three problems, and the third one is the most severe: how did Bob get hired in the first place?
Getting rid of a troublemaker is significantly more difficult and costly than simply never hiring them at all.
Familiar-Level-261@reddit
Eh, hiring is complex and you can't 100% judge candidate in hiring process.
Also some people might not be bad technically and so pass even the good hiring filter, but not have work ethics to stop themselves from pushing barely tested stuff.
munchbunny@reddit
This is absolutely true. Sometimes there really is a competence/judgement/accountability problem for an individual on the team. It’s the manager’s job to manage the distinction. You run a blameless postmortem, but if one person has a pattern of messing up, you address it privately with them and one of their goals becomes “practice the set of behaviors that help you make fewer mistakes”.
I’ve had the pleasure of running a fairly high accountability team for a few years, and the ones who take accountability don’t need blame to understand how they messed up and what they want to do reduce their own errors, and when they say “this system is too easy to mess up” I can generally trust that they are right.
I’ve also seen the opposite, people who try to take advantage of the “blame the system not the person” dynamic to deflect personal accountability. That’s not a reason to stop doing postmortems blamelessly, but as a manager you have to have the hard conversation with the person, such as “you need to pay more attention to best practices, before you do X you need to send me your plan for how to make sure you didn’t break Y, and if you do it on Friday afternoon you need to be ready to spend your weekend fixing it.”
nnomae@reddit
One of my pet peeves is managers who won't call out the person making the mistake. I still remember a meeting where a manager was going "some people are leaving work early" and we all knew who that person was, "some people aren't updating documentation" and we all knew who that was, "some people are arriving in late" and we all knew who it was and so on. Had he just taken each individual aside and pointed out the one thing they were doing wrong they'd have been fine, instead he annoyed everyone by blaming them all for a half dozen things they weren't doing.
thehustlingengineer@reddit (OP)
Absolutely, it is a team sport. I think it is important to learn from mistakes and not repeat them. Same pattern mistakes is definitely a red flag
Niewinnny@reddit
the first time something is fucked up its just a mistake.
Subsequent times that the same fuck up is not found is on the system. Anyone and everyone makes mistakes, that's why there are peer reviews and thorough testing to make sure no fuckups go through to prod. New fuckups are fine to be made once because you might not have had the time to implement shit.
And subsequent fuckups that do get found are on the person who makes them because why the hell are you making the same mistake for the 5th time.
Inevitable-Plan-7604@reddit
That's not necessarily true. We can't keep removing components from Bob's job because he's shit at them, until he's left with 0% responsibility and 100% pay.
"The system" includes "employing underperformers" and needs to be adjusted alongside every other lever and mechanism inside a team.
baron_von_noseboop@reddit
The "system" also decides who is on the team, what work is assigned to them, and chooses how to measure and reward individual contributions. So repeated individual failures are also still a sign of systemic failure. It wasn't just the individual who screwed up.
Chance-Plantain8314@reddit
There is and always will be shared blame but ultimately a person who repeatedly makes the same mistakes out of laziness and an unwillingness to learn needs to be addressed, whether with support or with accountability. If a fault slips through the system, the system needs to fix it, but if it's the 5th instance of the same developer making the same silly mistake, they have a share of the blame too and that has to be addressed.
campbellm@reddit
Classic Bob.
chucker23n@reddit
Yeah, but at the point, no replacing of individual teammates is going to fix the problem.
Chance-Plantain8314@reddit
Eh, I'm with you and against you on that one. When you're in an EU-based software company, job security is high. This is good obviously. But I've been in situations where we're stuck with a nightmare developer, the team is full, and it means we're not getting anyone else instead of them.
Replacing the individual can certainly fix the issue if that person takes accountability and cares about what they're doing.
Though I fully agree with you systemically - you could easily be assigned someone the same or worse. It's a dice roll.
chucker23n@reddit
I'm not saying bad teammates don't happen. They do.
I'm saying if the supervisor doesn't recognize them as a problem, give them an opportunity to improve, and ultimately is willing to kick them out, then the teammate isn't the problem; management is.
Chance-Plantain8314@reddit
Ah - absolutely agreed, and exactly the point I'm making: the whole system of a blameless culture hinges on that management.
CherryLongjump1989@reddit
EU can and does fire people, it's just that managers are lazy or out of touch and don't want to put in the effort in making sure that this happens in a fair and legal way.
rzwitserloot@reddit
Different layers.
When you're in a team meeting the aim of that meeting is to 'move forward': To ensure folks aren't just sitting there meekly receiving commands, but will say something if they feel there's room for improvement or spotted a potential bug. To keep everybody motivated, and to get the problem of the day fixed as best as you can (well, and quickly). That sort of thing.
Chewing out somebody who's had a bad week is a fucking terrible way to accomplish any of those goals.
When you're sitting down in person and are doing a performance review, which you should probably do twice a year (in various EU countries this is essentially mandated; it is already difficult to fire people, and if you don't do this, it's impossible), that is the moment. These talks are (should be) documented and signed by both parties. This is where you raise the issue that Bob can't stop fucking up: In a 1-on-1 with Bob (Bob + Bob's manager and nobody else. That manager should know a lot about Bob's job: It's Bob's team lead. Not an HR person).
That does mean somebody is responsible for tracking Bob's fuckups. But that's inherent to this job. Because the alternative is that everybody just says "Well, this one is on Bob" whenever the vibe strikes them, i.e. that the entire team is responsible for tracking this and that it reflects on Bob's personal record once somebody decides they vaguely recall the team blaming bob rather often.
See, now that I wrote out how that works surely you realize that's an utterly ridiculous way to do it.
Chance-Plantain8314@reddit
Well obviously, what you're saying is the entire point of blameless culture. But your example of why it has to be that way is just the complete opposite extreme. A totally blameless culture DOES have issues with accountability, that's the case by nature. That gap is filled if you have a good manager who's job it is to recognize a significant weakpoint on the team when it's having detrimental impact on the rest of the team. That manager's job is to support Bob and rectify the situation not in the public eye.
If you don't have a good manager, they aren't doing that. They're either chewing Bob out and impacting the culture and defeating the purpose of the blameless approach, or they're refusing to hold any accountability to the extreme, which means that Bob maintains no accountability continually to the detriment of the team, and also never gets the help he needs.
The point is that the system doesn't have to be one way or the other to the extreme. The entire point is that Blameless culture requires a good manager committed to the system or else the entire system falls apart.
Ultimately that layer, the manager, is the be all/end all because otherwise that culture's going to decay either from resentment within the team or a lack of speak up culture.
deathhead_68@reddit
Yes, some managers are terrible at knowing who is good and bad at different things on the team.
CherryLongjump1989@reddit
Which is why "blameless culture" can be a cover for incompetent management, but that's not a good thing. Managers need to be held accountable.
pxm7@reddit
It sort of also depends on how Bob fucked up. If Bob accidentally deleted a table in production, then it’s not really a Bob problem, the real problem is a few layers above Bob.
“Bob wrote bad code and review didn’t catch it” is harder to pin down — as you said, 85th percentile, and people have a way of fucking up in new and creative ways. But if it happens often, I’d be trying to understand why. Including how busy the reviewers are, and what is eating into their time, and how improved testing could help.
BrawDev@reddit
Man, I worked with a dude that did nothing for an entire year and the manager was nothing but supportive of him, and he just quit after a year to found his own business. Highly sus he just worked on his app while getting paid.
End of the day, it was the rest of us that had to pick up his slack.
sneak2293@reddit
I hate. Blameless culture ends up blaming the wrong person
kintotal@reddit
When I was a manager, I always preached never to fall prey to the Fundamental Attribution Error. We always looked to external, situational causes for failure. This produced a positive culture with less fear, less conflict, and happier people. That said, my job as a manager was to deal with those who weren't a good fit for their role. Situations where changes needed to be made were always difficult and required good HR practices to ensure success. Having a good culture and appropriate management are not mutually exclusive.
sidneyc@reddit
Those are some pretty bold statements.
diMario@reddit
From the article:
Agree in principle. If on the other hand it is always the same person causing the problems (Chad, Kevin, Ashleigh) then you might want to do something about it (and if management is unwilling to engage in confrontation, well, draw your own conclusions).
BiedermannS@reddit
The big reason for focusing on what happened and why instead of who did it is that who did it is irrelevant to fixing the problem at hand. Focusing on who did it derails the conversation into something non productive and it makes people afraid to report when they mess up. The focus should always be on how to fix the issue in a productive manner.
Who messed up is something that's only relevant when you start noticing it being the same person over and over again and even then you should figure out why it happens over and over again without shaming the person at fault. There's plenty of reasons why people mess up and many times there's room for improvement to make people less likely to mess up. Sometimes people just get unlucky as well.
Of course, sometimes you do have people who aren't fit for a job and make mistakes all the time and then it needs to be addressed properly, but that shouldn't be the first thing to focus on.
Izacus@reddit
That only works if the root cause is not incompetence and/or malice.
Even aviation - the birthplace of blameless postmortems and resulting procedures - will assign blame to pilot error when it's obvious that the pilot worked knowingly and directly against safety and sound judgement.
glotzerhotze@reddit
This is called accountability and if people can ditch that hiding behind processes you should evaluate your company culture.
Izacus@reddit
Yes, blameless postmortems is how people shed accountability. It's one of the accountability sinks - https://aworkinglibrary.com/writing/accountability-sinks in modern corporations.
knome@reddit
mistakes are something that humans will do. tools should be capable, but reasonable safeguards being built into them is reasonable. the guy whose typo took down all of S3 (forcing them to cold boot for the first time ever as overload cascades rippled through the system preventing correcting it in place) resulted in fixing the tool so that it could not reduce past the amount of S3 that was required to keep the service itself operable.
which is not to say someone can't be incompetent, but that systems should be in place to catch incompetence before it causes real problems.
code should be reviewed, automated tests should catch issues, more than one person should be part of deployment decisions, you can do manual tasks by having one person with the runbook reading and another on the keyboard, checking each other as they go through a process, standard day-to-day commands can produce actions that require sign off before execution.
how much of this you want to put in place is a call the team has to make. if your software depends on no one fucking up, it isn't a matter of if your software will fall over, just how long until the next time it does.
Izacus@reddit
The point is - no tool, no software, no process will defend you against malicious actor inside your team. So your postmortem needs to account for that option as well. Otherwise you're not covering all your bases.
Dreadgoat@reddit
Blameless culture is supposed to cut both ways. If you always go to blameless as default, establish that culture very strongly, and always make every effort to make systems robust and un-fuck-up-able as is reasonably possible, what does that entail when someone somehow manages to fuck something up anyway?
The new guy sometimes deletes something important, or finds an unexpected way to push test changes to production. This is valuable and good, as the new guy has inadvertently discovered flaws in the system and is helping the team become more robust in the long term.
If the second new guy comes in and clicks through 17 "are you sure you want to annihilate the planet and fuck your grandma?" prompts and dismisses 5 "this action requires permission from god himself" notifications, that guy gets axed instantly without a second thought.
It's blameless every time up until it can't be blameless, and then it's cause for immediate termination.
BiedermannS@reddit
Sure, but in my experience it's neither malice nor incompetence, that's why I said you shouldn't start there. I also said you should look into it deeper when the issues pile up and it's always the same person.
In aviation I'd expect them to launch a full on investigation into what happened and look into all aspects, because there are lives at risk. I still think you should start with blaming the person, but work out what happened and if you see the reason was incompetence, then focus on the person.
Also, most software is not aviation. There aren't lives at stake, so it doesn't need to be that strict and you can even accept some incompetence and have the person do training to help them.
Obviously there are cases where the best course of action is to fire someone, but even then the first step should focus on what went wrong in order to fix the problem in a productive manner and then look into the why and see if there's incompetence at okay.
rollingForInitiative@reddit
It’s also about preventing future problems, because people who know they’ll be punished for mistakes will just try to hide them, which just causes bigger problems down the line. You want someone who messed up to immediately tell everyone relevant what they did so it can get fixed properly, and perhaps so that the mistake doesn’t turn into something bad at all.
But yeah, if one person keeps making the same mistakes they aren’t learning, and that’s a different problem.
Robodude@reddit
At all the places I've worked we have had a requirement to have code reviews before anything is merged in. This means that if Kevin introduces a disastrous code change, someone else had to have approved it. I may be naive in thinking this approach is standard across our industry. But in these environments, it makes placing the blame very difficult.
diMario@reddit
As a Dutchie, I couldn't agree more. Always look for a solution first before starting to investigate the cause and formulating a strategy to prevent the same problem in the future.
However, also as a Dutchie, when formulating a strategy to prevent the same problem from happening again, you've gotta be realistic and if that involves pointing fingers, then fingers should be pointed.
BiedermannS@reddit
Absolutely. Fix first, work out what happened, take appropriate action to make it less likely or impossible to happen again.
Bayo77@reddit
Its software, if you dont use git processes, then that is your problem. If you do use them, then there are at least 2 people that are responsible for the changes.
There should never be 1 person being able to break something on his own.
key_lime_pie@reddit
You also need to determine why it's the same person, because it still may not be that person's fault. I've been reorged in and out of competency and I've seen the same thing happen to other people.
NeilFraser@reddit
But be careful of the case where Chad is the root of 90% of problems, but he's also the one who does 90% of the production work.
Emergency-Diet9754@reddit
Well I had exactly this scenario come up. New SI came in and started bashing a non prod database with incorrect credentials that locked the service account.
Rather than fix handling of login credentials, management wanted the server to be modified to never lock accounts.
Yup makes sense given that that no account had ever been locked for years leading up to this.
diMario@reddit
Ah. The trick in dealing with clueless management is this: agree with whatever they suggest, promise to apply whatever fix they want, and - this is crucial - add that you have an idea that will make doubly sure that this problem will never happen again, and it will cost almost no extra time.
Make sure to only mention it in the discussion and not ask for permission to implement it.
Then do whatever you feel is necessary to fix the problem, possibly ignoring the solution preferred by management, and report back that the problem is fixed without going into details.
Should discussion arise, you can then point out that (1) your solution works and (2) management implicitly gave you the go ahead to implement it during the original discussion of the problem, where they suggested the thing that is not really a solution.
reivblaze@reddit
The risk with this approach is if (1) is not met. Ie, you were wrong then you are fucking up big time.
diMario@reddit
Well, you know what they say ... If you're not part of the solution, then you're part of the problem.
The honourable thing to do in this case would be to admit you fucked up and accept the consequences.
Sadly, few people these days can admit - even to themselves - they did something stupid.
reivblaze@reddit
Yeah and as always that depends on if its even worth it the risk for the rewards. Because sometimes the rewards are nonexistent. Its finicky and hard tbh.
CherryLongjump1989@reddit
I gagged a little reading this.
ayayahri@reddit
How do you know who is causing the problems ? Is there someone on the team who is constantly pestering management to complain about other people's performance ? Are you sure you have an okay understanding of the team dynamics ?
You should always be suspicious of those who are eager to assign blame.
Known-Western-1294@reddit
Then it can be rephrased as a HR process issue - why such an incompetent candidate was let through. It can sound a bit passive aggressive tho..
thehustlingengineer@reddit (OP)
I think if someone is making new mistake every time, is is fine. If someone is doing the same mistake repeatedly, then it is a matter of worry
diMario@reddit
Mmm. Someone making a new mistake every time could indicate that they for some reason or other have a different way of looking at things, as opposed to the people on the team who don't make those mistakes.
I mean one is likely to do the wrong thing when reacting to a newly discovered fact, requirement, bug, or quirk, which when working in software happens on a daily basis. There are the team members who deal with these discoveries and fix the problems that arise in a good and permanent way, and then there is Kevin, Chad or Ashleigh who consistently finds a wrong way of reacting to these things.
I'd say that tells us something about Kevin Chad or Ashleigh.
glotzerhotze@reddit
More so it tells you something about the manager of Kevin, Chad or Ashleigh, who clearly though it was a good idea to - repeatedly - hand out tasks to people who are not capable of doing them as the business demands in well articulated guidelines.
Spoiler: it was NOT a good idea by said manager and business should talk about that topic, too
diMario@reddit
Unfortunately, because most managers are special, they get special consideration.
glotzerhotze@reddit
A fish rots from the head down
🤷♂️
chucker23n@reddit
This is true.
But those are two separate things.
Mixing those things hurts both the team and the project.
glotzerhotze@reddit
This is solid advice.
trippypantsforlife@reddit
Ashleigh reminded me of r/Tragedeigh
frezz@reddit
This is a problem of performance, and should not be handled during a post mortem.
If management is not dealing with that, then you have much bigger problems than post mortems that need solving
Ok-Cantaloupe-9946@reddit
The why it happened would be recruitment process then would it not?
Character_Respect533@reddit
I used to work in a team where a post mortem is fun because we just found a new breaking point in our system and it's time to improve it. Kudos to the EM!
diMario@reddit
Well, yes and no. If someone has a knack for doing unconventional things and thereby exposing subtle ways in which the system is imperfect, yes, by all means, applaud them for it.
If, on the other hand, someone is cranking out code with no regard for error handling, performance, DRY or just plain common sense, that's a problem.
key_lime_pie@reddit
If you want your QA department to reflexively hate you, this is the sentence you want to use. I've improved morale so many times just by asking PMs to say "Why didn't we catch this?" instead of "Why didn't QA catch this?"
In my experience, the overwhelming number of escaped defects have come either because the QA team literally couldn't test the scenario that causes the defect, or were told not to test it either explicitly or implicitly.
At my last job, I was put in charge of the RCA team when it was formed, because I was running QA and there was an expectation within management that QA would be the root cause of the majority of escaped defects (despite me telling them that it wouldn't). After three months, the RCA team was disbanded because the root cause was invariably "management," and you can't really pound a desk and demand that somebody do better when that somebody is you.
shevy-java@reddit
This depends on the coping strategy. I'll give an example.
Many years ago I was working with other co-students in a biotech / microbiology lab (mostly as a training area, so not a "real" lab with paid professionals). The area was a bit convoluted and you had to go all over the place, sometimes also fetch stuff from other floors. Anyway. One area was the breeding room, aka temperature of 37° C, to get the bacteria (or whatever else is growing) to grow faster. The room itself was a bit below the 37°C, so only the breeding area was annoyingly hot; and lots of other students were there, going in and out. I was also one of those people who naturally had a higher heart beat, so thus generating more heat, even when skinny; though I was no longer skinny back then, to word it nicely. Gist was: it was damn hot and this affected my thinking, which got slower, and working for some hours was also tedious. Sometimes students forgot to close the lid/chamber and then the temperature dropped off. This can be problematic based on what is tried to achieve; e. g. too low temperature, smaller growth, less material to analyse, lower OD measurements and what not. Tracking was done either at one-hour intervals, or less than that, so we ended up going like into this place 6x per hour or 4x, for a total of perhaps ... 20x or so (we split up the tasks of course, so not everyone was doing the same, different groups operated differently, some had to start again due to mistakes). So I was going like into the place several times. Now, another female student just was about to start, but noticed the temperature was off and asked me whether the student before me forgot to close it. I wasn't sure how to answer this: for one, I could have made a mistake (I was sort of daydreaming so I didn't pay that close attention to my work); or it could have been the other student (I actually think it was him). But either way, giving a good answer, aka putting the blame either on me, or on him, wasn't a good strategy, so I tried to go with that I wasn't entirely certain, which was kind of evasive. There are many other ways to deal with such a situation (I also was not prepared for that), but to me it simply did not feel right to put blame onto another student even IF that student was at fault for sloppy work. The female student was also not super-happy with my reply and then assumed that I was the one doing the sloppy work, so that was a lose-lose situation. Now I could put the blame on her! But I think the situation was overall not good, since the discussion would pinpoint towards accusing someone. In hindsight I guess a better strategy would have been to first say that I was not sure who was to blame (my head was really dizzy, when you are like in or near an oven for hours, you don't think normally in a tedious work situation), but I think I would have probably tried to explain the problem I had here, with accusing anyone else (or, myself; I didn't want to accept blame for something I didn't do either, so that was not a great situation). An even simpler solution would have been for some automatic way to guarantee the temperature is ensured, be it closing doors or a beeper on the spot or anything like that. I've used that to try to find strategies to not put blame on anyone (if possible to avoid), and if not then to try to come up with alternatives, such as the story of the frog and the princess and what happens to the frog if there is no princess. For some reason people are understanding stories (analogies) better than the "YOU IDIOT!!! YOU JUST COST US TWO MILLION EUROS!!!".
Enough-Ad-5528@reddit
Amazon was like this for a long time. Between 2010 to somewhere around 2019, 2020 was the peak of the Amazonian culture.
Exceptions did exist of course given it was such a large company. But mostly it was a blameless culture, always encouraged to focus on the right thing to do for the customer, design for long term; share learnings from failures and outages openly. Somewhere after that money became expensive, projects stopped getting funded, people were made to be insecure about their jobs and, metrics started to be manipulated or plain fabricated.
Now it is all about survivorship, backstabbing and team/org politics. I guess when happens when times are tough and not enough money going around. I am just glad I got to experience the peak for almost a decade.
zodomere@reddit
It is still supposed to be like this. But yeah lots of politics. COEs seemed to be used as punishments rather than learnings.
fragglerock@reddit
At least it was a good place to work in the lead up to it inciting genocide in Myanmar!
Feel good for ya buddy!
chucker23n@reddit
Wasn't that a Facebook/Meta thing rather than AWS?
Awesan@reddit
Crazy to imply that Amazon does not have enough money going around, as it is literally reported 18bn (!) in profits last quarter, up 4bn (!) from same quarter in 2024.
chucker23n@reddit
Sure, but the question is: how much of that ends up with managers who can allocate it to the team?
Enough-Ad-5528@reddit
Yeah, exactly. Until a few years ago, Aws rarely deprecated stuff and even if they did, do that with utmost care, extremely long lead times and generally had much superior alternatives. Not they are just turning off services and asking customers to find something else.
Even services that are not fully turned off, some are just allowed to keep running existing versions with a few people from other projects being asked to offer critical-only oncall support. Projects got defunded, new projects and initiatives are hard to get funded and mistakes are treated more severely and even though there is more money in the bank, if it is not AI then it is an uphill battle to get something funded.
Nervous-Spite-7701@reddit
bro said not enough money to go around
xSaviorself@reddit
We just had a clusterfuck of a time at my shop due to one persons mistake, and it wasn't intentional. Blameless culture is the only way to properly position a business to improve process and cultivate a positive work environment.
Someone who fucks up probably knows and feels bad, especially when it affects other teams/units in the business. They don't need to be reprimanded, they need to have the resources to bring about better processes. It's on leadership to provide that.
Team fucked up? It's a learning exercise for everyone. Bob fucked up? Now we're looking at Bob with a magnifying glass for no reason.
Full-Spectral@reddit
In a highly complex system, over time, everyone will screw up once in a while. If that system is old and has suffered from the usual ad hoc 'improvement' that most do, even more so because the problems become more and more whack-a-mole.
I made one a couple months back. The product is very complex, highly configurable, and (horrors) in C++ where there are so many ways to screw the pooch that we all are looking so hard for the tricky ways it can happen that a very simple one slipped by me and all of the reviewers.
To be fair it was a bit of an emergency change right at the end of a release, so it had too little time to get banged on and the issue exposed.
Round_Head_6248@reddit
ai slop
who_am_i_to_say_so@reddit
I’ve been seeing this same 4 points being made for ten years sigh
PersianMG@reddit
Blameless culture works because blaming somebody for a unintentional mistake is a waste of time. It demoralises that person and the rest of the team, and the issue needs to be solved anyway. That wasted time is better spent improving processes etc.
With this being said, sometimes the process is fine and the mistake is a human error "person not reading docs and ignoring the warnings which led to DB being dropped". In those cases, its very much productive to focus on the person that caused the issue. Not to blame them but to make sure they learn so it doesn't happen again.
HoratioWobble@reddit
A lot of companies I've worked in that practice blameless cultures (supposedly) just practice a blame culture.
Where by they all agree it's a team problem, but still ridicule and chastise the person who made the mistake
syklemil@reddit
There are some other bits from the SRE book that's good to pick up along with this, especially the concept of an error budget.
With blameless PMs it's kinda easy to also get working in a direction of building up ever more automated guards, but they also often slow people and teams down. Ultimately you may build a kafkaesque system.
Sometimes what you want is to have that PM, and then conclude that nothing more will be done and write it off on the error budget, because the way to prevent it from reoccurring is too costly relative to the error.
That said, I am generally a fan of "make invalid states unrepresentable", and then linters and policy engines to cover up the cases where we have some existing system that people may inadvertently configure into some invalid state.
JoelMahon@reddit
I think the approach at my company is pretty good, all our team members currently make mistakes, we're all human. sometimes they slip pass review, which means the reviewers made a mistake as well. we never roast a specific person to the higher ups because we'd all be roasted and none of us want that and it's not productive. we own those mistakes as a team.
in the past we've had notably slow or notably error prone team members and in those cases we privately message our immediate team manager (who is a team member) and let him know, and they try and correct it, and if correction doesn't work then I guess eventually they'd get fired. it never came to that as the only person that was close to being fired, quit for another job. but we still never roasted him in front of higher ups.