24 7 support at scale sounds great until your team hits its limits
Posted by Such_Rhubarb8095@reddit | sysadmin | View on Reddit | 35 comments
Traffic spikes at 3am because teams across asia are waking up, after hours tickets start coming in from europe, and a small team of 12 is expected to keep everything stable without things breaking down. response times look fine at first, but after a day or two they start slipping because people simply cant sustain constant coverage without breaks.
leadership talks a lot about decoupling support capacity from headcount, usually followed by some push toward automation or ai that is supposed to absorb the load. in reality it often just shifts the work somewhere else, or adds another layer of things to monitor when something goes wrong.
weve tried scaling with contractors during peaks, but they dont stay consistent when demand is actually high. self service tools help in theory, but users still escalate everything when it gets even slightly unclear. dashboards always promise smooth infinite scale, but the real world still needs people to step in when things break.
how are teams actually handling 24 7 global support at scale without burning out or everything collapsing behind the scenes?
Centimane@reddit
... What?! No it doesn't. It's a famously hard problem.
OregonTechHead@reddit
Not really. You just hire people to work all hours in shifts. It's not hard at all, and tons of companies around the world have been doing it forever.
It's only "hard" when you're not adequately staffing
Centimane@reddit
That's part of what makes it hard.
It's true that it's a problem that can be solved by throwing more money at it - as with all problems. But striking a balance between "always having coverage", and "optimizing the staffing" (e.g. you have enough for peaks, but don't normally have a ton of people with nothing to do).
OregonTechHead@reddit
None of this is hard though.
It's just normal staffing for any business across all departments.
In fact, I'd argue it's incredibly easy in the IT world because we typically (and should) have concrete metrics of how many tickets are coming in, how many people we're supporting, average ticket closure time, average response time, etc.
It's very easy to quantify how many people you need.
melissaleidygarcia@reddit
most teams survive with strict on call rotations and very tight escalation rules not constant coverage
OregonTechHead@reddit
What OP has isn't on call. OP has users in different time zones that require like time staff.
Medium_Banana4074@reddit
But "on call" is when you get an automatic callout when some server crashes and you have to get it online again.
If you have to solve user tickets, it's a night shift, not on-call
MitochondrianHouse@reddit
That's what I'm confused about. You have 12 people. Put them on 3 different shifts.
tdhuck@reddit
That's even worse in the OPs scenario. If they are having trouble with 12, imagine more teams but less members per team.
The issue is that support costs money and companies don't want to spend money on support.
Every company could have awesome support if they wanted to pay for it. I'd work in Help Desk my entire career if the compensation was good enough, but it is not. Companies want to get away with paying the least amount they can for support, that's the bottom line.
Zenkin@reddit
You probably need two people per shift to guarantee calls aren't getting missed. 168 hours in a week with 8 hour shifts gives you 21 total shifts, double that to 42 to account for full coverage. 12 people can cover 60 shifts, so that leaves 18 shifts for additional workers to handle peak hours. Two extra people covering the 9x5 period (or whenever things are busiest) is 10 more shifts. That's 8 shifts or 1.5 "extra" people of overage per week.
Can that technically be done? Kinda. Two people can't take vacation or get sick at the same time, so.... that's a problem. If these people were only doing escalations, it could be viable, but if they're monitoring and doing all the work.... I don't see how it can function long term. We have a "triage" team for 24/7 coverage, and they have more than 12 people, and our business isn't even that big.
JamesRustl3r@reddit
Even if there was no overlap between shifts, that's 120 of the 168 hours in a week.
BoysenberryDue3637@reddit
This is a staffing issue. You have follow the sun staffing, you require follow the sun support for that. I had people in the US and AU to support the company. If upper management believes a single time zone can support people world wide, they are delusional.
OregonTechHead@reddit
Why?
Just staff those hours. You may pay more for 3rd shift workers, but it's absolutely possible and not even an issue.
OregonTechHead@reddit
I don't understand why this is a problem? Your staff scheduled to work those hours should be able to handle those issues.
Or do you not have 24/7 scheduled staff?
Kumorigoe@reddit
Sorry, it seems this comment or thread has violated a sub-reddit rule and has been removed by a moderator.
Do Not Conduct Marketing Operations Within This Community.
Your content may be better suited for our companion sub-reddit: /r/SysAdminBlogs
If you wish to appeal this action please don't hesitate to message the moderation team.
Jazzlike-Vacation230@reddit
i never understood it, on call outside of healthcare is sooooooooooooo weird dude. It's cheaper to hire a dedicated offshort team that's awake when you're sleeping
OregonTechHead@reddit
Only if you don't take into account the liaison and the onshore management of that team.
RobbyBurgers@reddit
12 people? Lmao
Try doing it with a person rotation and then talk to me.
spamuser555@reddit
No need to be rude
Amanpushpraj54@reddit
Most places are held together by tired admins and luck.
EnoughGrade1906@reddit
Even if each ticket is quick, like 10 to 15 mins, it adds up quick when they flood in. We had that issue last year, and after a couple days we were so behind that customers were pissed and adding more follow ups. Makes it hard to fix things right.
TechnologyMatch@reddit
that’s the bit people underestimate. one 10-minute ticket sounds harmless… until you’ve got 40 of them breeding overnight like gremlins. then half the day disappears into follow-ups and apology emails instead of actually fixing the problem. queue management matters more than people think
kerosene31@reddit
Bring primary support for people in different time zones is not what on call is for. Those are just work shifts. We're on call for major off hours issues, not supporting people during their work day on the other side of the globe.
You need people to cover 3 different 8 hour shifts. Not "on call". What your team is being asked to do is work 24/7. That's insane.
sryan2k1@reddit
You follow the sun and have support in all regions. Otherwise after hours support is true emergencies only and if you're having more than one a month you need more "on shift" workers.
MalletNGrease@reddit
It doesn't sound like you're actually doing 24/7 support. Double your headcount and start shifts.
AlexisFR@reddit
Local IT support Teams is the only proper way.
Kaligraphic@reddit
Human interaction scales linearly with the number of humans you need to interact with. This is inherent in the problem domain.
Solution: change the problem domain. Limit coverage to system availability - support costs scale with systems, not users. Separate support team from end user interaction. Catch: need separate resources to address support needs.
Solution: limit acceptance of support requests. Rate-limit request creation or limit to only accept support requests when capacity is available. Support requests will scale with support capacity, not supported population. Catch: needed support will go unprovided, leading to business losses and shadow IT.
Solution: decouple reporting from reality. Put a chatbot in front of actual support and pretend it is adding value. Train it to fob people off and count responses as resolutions. Use AI to generate marketing materials claiming success, then spin out chatbot as an AI startup to collect sweet VC cash. Catch: really just the first two solutions in a hat made of lies. Needed support goes unprovided. Need to time IPO for correct point in hype cycle. Need to time exit strategy for optimal cash extraction. Need to hire accountant for all the money you'd make.
chickibumbum_byomde@reddit
this will lead to more burnouts imo, cant just scale support 24/7 just by handling more tickets, that’s what burns people out.
a good configuration should utilize automation and optimize Alerts as much as possible, you want just the stuff that a human can do to reach way at the end, fix or suppress repeat issues, stop treating everything as urgent, automate only the obvious, repeatable stuff, The real shift is deciding what actually needs a human right now. If everything is treated as critical, the system will eventually collapse.
Long term particularly, it’s not about better coverage, it’s about less work hitting the team in the first place.
blondasek1993@reddit
More. People. Small teams focused on a specific region with leader who understands each "pod". Basic automation/self service for the most common tickets (mfa reset, etc.). Put one lazy but smart person in each pod - let them work and give them a bit of leeway - observe them and incorporate what they do. Usually the can improve the processes a lot and cut the common tickets flood by 20-30%.
Best teams I was working in/managing were when we had about 45-55% occupancy time across the week. It seems low but team was stable, we had capacity to react on demand and there was no employee rotation. On top of that any new things were better absorbed and there was a space for improvements.
teqqra1@reddit
Agree with that percent!
bukkithedd@reddit
24/7 support with global coverage done by 12 people? I'm afraid I must ask you to tell your management to stop drinking anti-fouling paint ment for ships on the daily, as it's CLEARLY affecting their ability to think.
It comes down to what level of SLA that's required. If it's handling all the everyday bullshit tickets (mousepointer not symmetrical, can't find the document I saved, my outlook looks weird etc) for literal thousands of employees worldwide, then you're off into the deep end straight off the pier with an anchor tied to your feet even before you start factoring in timezones, number of heads that are actually on call etc.
Constant global coverage for regular helpdesk-work with that few heads isn't possible, period. If it's for very specific things like networks/firewalls and datacenter-ops etc, it changes to a maybe with the caveat of it even THEN being a tall order depending on loads. And it's one that's hard to automate your way out of.
Cooleb09@reddit
TBH that head count could totally work, depending on org size and support requirements. Global support doesn't mean thousands of users, there are a few orgs with smaller headcounts that happen to have bodies all over.
12 FTEs spaced out is 3 drones & a supervisor on each of 3 8 hour shifts 'following the sun'. Depending on load on the super, either one of them (probably based on time-zone favoritism) gets to be manager, or get an extra body for that. If your average ticket volume can be handled by n-1 team members you're set
Now if your ticket queues overrun that team size... yeah need more FTEs, or less support intense business services.
Timely_Aside_2383@reddit
I remember our team getting hit hard during product launches, tickets doubling and us just reacting mode the whole time. Response times tanked and it snowballed bad.
RansomStark78@reddit
Consider warm up teams
RansomStark78@reddit
Started with 25 now at 115
Lol
Mgt is stupid. New hires are just bodies, i lol in my office
I give advice, but ownership is c sweet (paycheck).