Need thoughts on “firefighter” processes
Posted by MoveInteresting4334@reddit | ExperiencedDevs | View on Reddit | 12 comments
Hey all,
I’m working on a mission critical internal app at a huge international bank. Our application is used by about 150,000 bank employees. Every sprint someone is the firefighter, and takes on responsibility for initially handling any Prod incidents, as well as planning and executing deployments.
We recently finished a very rough re-write of the application (app is extremely complex, little to no design documentation, team didn’t have experience with new tech stack, etc.). The difficulties are a long story but suffice to say we are now in production and prod incidents are not a rare thing.
Since our go live, it’s a common thing for the firefighter to be so swamped that it’s all they can do to record and track incidents as they come in, let alone try to triage or even delegate them. Things were falling through the cracks or taking weeks to be addressed.
After my sprint as firefighter, it was so bad that I went to our scrum master and worked out a flow to distribute the workload a bit more and keep a single source of responsibility through the process. We ran this by the teams (devs and analysts) and management. We got buy in across the board after a few small tweaks. It goes something like this:
-
Incident comes in. Analysts confirm that an incident ticket has been created, pull any logs they can find, and engage our prod support team.
-
If prod support team can’t fix, if it’s not a critical defect it goes into the backlog for future planning. If it’s critical, it goes to the firefighter with logs and a description of the issue.
The overall idea being that it should only reach the firefighter dev if it’s a critical issue that the support team can’t handle.
This sprint is our first sprint implementing it and I’m the firefighter. It’s already falling to pieces. In our staff meeting today my manager and I had the following exchange:
Manager: there were some incident emails from users that weren’t responded to this morning. The firefighter needs to watch for those.
Me: with our new process, the analysts should be engaging with the users and pulling me in if necessary.
Manager: management won’t care who did or didn’t respond, only that nobody did. Processes are great, but there are always exceptions. If you see them not responding, you need to.
I dropped it since it was a group meeting and I don’t really disagree with her in principle but I have several issues with this. With this approach, there’s no accountability. If analysts don’t respond, it just falls on devs. There’s also no single responsible party. What happens if the analysts are busy and assume devs will respond, and devs are busy and assume analysts will respond? Wouldn’t it be wiser to work within the process, like forwarding the email to analysts and saying “hey, did you guys see this?”
How do your teams handle this? And any advice for getting teams to embrace accountability and organization when their norm has always been dumpsterfire/all hands on deck?
Equivalent-Score-900@reddit
This is a rough situation. I think you played it right in theory but I would have got managerial buy in before implementing it. Role of fire fighter is a way to cover up bigger issues in the process/org.
I was once in a similar role with a large corp where on call was 365, and it destroyed me. 20-40 incidents a month. I talked about process changes with managers, directors, and vps until I was blue in the face and got zero buy in. Did what I could control and got it down to 10 but it was still too much of an invasion of my WLB. Lasted 15 months and had to find something else.
Try to change the process, do what you can control to manage it without process changes (identifying areas of breakage, add tests, get buy in with your peers).
Good luck!
jl2352@reddit
The process sounds overly complex tbh.
One thing I would say about changing processes; they always go wrong at first. If you put it in place, that isn’t enough. You need someone to champion it and keep it moving when it starts.
That would include looking at emails coming in and prodding those responsible to do what was agreed.
cheeman15@reddit
There are ways to reduce firefighting however until you can get on that track in a sane way you need at least two people firefighting. Never leave your firefighters alone because the fire can spread quickly both technical and humanwise
PragmaticBoredom@reddit
There are a lot of confusing and concerning things about this situation. Is the “firefighter” supposed to be a single person who handles customer support, on-call, triage, and even investigation of problems? This feels like the company has some big problems but everyone just wants to put their heads down and do their own job. So they invented a new rotating pseudo-job and assigned all of the unsavory problems to that pseudo-job.
MoveInteresting4334@reddit (OP)
Yeah, you’re pretty much spot on. The only caveat is that the Firefighter is encouraged to delegate to other devs as needed (like the Firefighter saying “Hey Paul, can you triage this defect that just came in for me?”). But even taking in issues, recording them, and finding someone who can take a look at them is a large task by itself if there are a ton of issues rolling in.
PragmaticBoredom@reddit
Is there any customer support type team at all?
Having devs be front line support for 150,000 people is completely unreasonable, not to mention a waste of expensive dev resources.
BonnetSlurps@reddit
We used to have two levels of support at my previous company, and it was wonderful:
PM would then sort the bugs. Urgent tasks would go in to the sprint right away. Thanks to the reproduction steps anyone in the team could take it. Non-critical issues would go into the next sprint.
MoveInteresting4334@reddit (OP)
This is exactly what I want to emulate, but they treat the suggestion like I’m saying the critical issue should go through seven layers of bureaucracy and nine approvals.
I just want a single point of responsibility and a logical flow.
BonnetSlurps@reddit
I really miss that company. QA is still their secret weapon.
And it was the best QA department I ever worked with. It was mostly CS students working part-time, but with a 1:1 QA:DEV ratio. And the quality of the work was 10x better than any other QA department I ever worked with.
chills716@reddit
Not a ding on you at all, but especially being financial and mission critical, it should have been tested so thoroughly that all issues were discovered prior. A long time friend works for one of the largest financial services companies in the world and they deploy to production twice a year due to the critical nature and absolute requirement that it works exactly as it should.
Like I said, not a ding on you, just how this has been framed is very concerning.
MoveInteresting4334@reddit (OP)
Couldn’t agree more. I spent nine months begging for a regression suite at LEAST. They finally hired an overseas contractor team to build it out.
That was a year ago. Still never seen it.
chills716@reddit
I predominantly work in healthcare and the amount of companies that are lackadaisical about security until there is a breach is astonishing. Both issues are a pay now/ pay more later problem.