Dealing with a Team with primitive Infra that seems fine with it. Cultural Mismatch?
Posted by PressureHumble3604@reddit | ExperiencedDevs | View on Reddit | 49 comments
Some months ago for various reasons I joined this team, quite prestigious in the big company and with well above average engineers.
They are tackling a complex domain and they have been doing so for years with a microservice architecture.
That's fine.
Until I discovered they have very primitive infra and the microservices architecture is bloated and inefficient (with horrible horizontal scaling)
Some examples:
- Almost no orchestration (No Kurbernetes or similar)
- Extremely simple and hardcoded load balancing
- No tracing, no proper debugging other than console.log and pray the machine gods.
- Barely usable testing environments (no debugging there)
- No service discovery, if there is any I have yet to "discover" it
- Very limited metrics, it's hard to set up new ones and they are not precise
- No tool to manage logs
Now the reasons why the system is in this state is that some people many years ago fucked up and management doesn't seem to care that the system has frequent outages and the engineers spend 90% of the time firefighting.
At the same time there are so many small things that at the low level we can do to improve the day to day life, things that should have been done years ago.
The problem is that personal initiative is frowned upon, well partially, when it's not there is no guidance so it's just people on their own that don't coordinate.
While at the lower level we discuss the issues frequently in an informal and inefficient way ( yes the department communication is crap at every level), not everyone seems to view the situation as dramatic as it is.
The daily life of an employee is made mostly of ssh-ing into multiple production machines, grepping several logs and entering a rabbithole to investigate the daily outage, if we are lucky we can run some horrible shell scripts that may help us investigate of hotfix the issue.
Is this normal?
Because the people that work with me are quite smart and definitely the best one I have worked with but they seem to not have their priorities straight, not they can communicate properly
morosis1982@reddit
If the team has been around for a while it's possible they started with a monolith and migrated to microservice and just didn't know what they didn't know, didn't get to update other aspects and now don't care much.
I've mostly worked in r&d, building new systems, where you need to ship features, so have a strong history of minimising the time I need to spend on the day to day support so that I can build.
If they don't have a significant backlog of features, then the day to day of just investigating and fixing issues could be 'enough' for them.
Honestly, the thing we do in my team is just ask ourselves each sprint what is causing us pain? Create a story to tackle it and just get it done. If it's a big change then we break it down so that it doesn't significantly affect capacity.
Just start somewhere. If parsing logs is the issue, see if you can introduce a log ingest, maybe there's capacity to get something open source installed and ingesting the logs so that you can query in one place. Especially in distributed services, add a tracking token to events so that an event can easily be traced across the system.
If you want to get the team on board, start by asking what they like doing the least and plan to solve that.
PressureHumble3604@reddit (OP)
no sprint review is being done, any propose of formal meeting is denied.
Day to day is fine as long as you are not constantly firefighting and damaging business by having outages every other day.
morosis1982@reddit
How does work make it onto your backlog and get prioritised?
That's the forum to propose adding something that improves your flow. Then you can discuss it as a team Async if communication is good.
Given the reticence to have a formal catch up about it though, might be a bit of an uphill battle.
Usually I'll improve something for myself first, then share it and when the team sees how much easier I'm having it they'll usually want in.
PressureHumble3604@reddit (OP)
Usually we it’s a urgent fix for the daily outage or an epic with barely any description about something I have never heard about before that gets assigned to me by management because the business needs it.
morosis1982@reddit
Mate sounds like you need AI.
Grab the epic, run it through Claude with a verbose filter, then sell it back to management and get them to sign off.
Seriously though, create a story to address something the team hates, get some support for it and sell it to management as productivity gain. Rinse, repeat.
Bonus points if you can tie it to an actual metric, because management likes pretty graphs.
PressureHumble3604@reddit (OP)
AI is not good enough to deal with such context.
The performance degradation is so serious that I can’t sell any serious effort to management easily because I am not supposed to deviate from business goals. Minor effort is ok but the team doesn’t agree and even when they agree they are too busy to review.
morosis1982@reddit
That was somewhat tongue in cheek, if they won't give you good epics.
At the end of the day, you either try to make small improvements or move on. I've been using copilot to build support scripts that help me automate some of the manual parts of the job when dealing with inputs from other teams - I haven't yet got a full automation flow, but in a lot of cases I've trained them to give me input in a certain format that I've written parsers for, so I can pretty much eyeball it, drop it into a bucket or API and away it goes and fixes shit.
PressureHumble3604@reddit (OP)
doing similar but it’s still not optimal
Lothy_@reddit
Sounds like you have all the answers regardless of what they think their problems are.
Have you tried engaging in good faith with their problems, as they see them, without your preconceived solutions being what you’re trying to drive them to?
Whatever they’re doing is seemingly working well enough as far as they’re concerned.
PressureHumble3604@reddit (OP)
daily outages with business impact is not in my definition of working well
Lothy_@reddit
Okay. But are you meeting them where they stand, starting with their problems, and with an open mind about solutions?
Or are you just looking to nudge them towards the greatest hits from your previous job?
PressureHumble3604@reddit (OP)
which hits from previous job? what are you talking about, i will not engage with people commenting in bad faith.
Minimum-Reward3264@reddit
Are you a dev or devops?
PressureHumble3604@reddit (OP)
dev, sadly doing too much devops because we practically don’t have them.
shifty_lifty_doodah@reddit
Yep seems normal. It’s a business. If the software makes money and helps customers more than it hurts, it’s working.
It’s very hard to build a business around software customers will actually pay for. It’s normal for things to be a mess. That’s kinda what they pay you to deal with
PressureHumble3604@reddit (OP)
I absolutely think your viewpoint is the worst in the industry, sadly it’s quite common though. In my career I have yet to see it not turning into a disaster and a huge loss of revenue or worse.
vismbr1@reddit
Developers tend to be alarmist. I’ve been there myself. You look at a system and wonder how it works at all. It feels like it’s only a matter of time before disaster hits.
But in reality, messy code, bad design or bad infrastructure is usually fine. We make it work. Big rewrites and changes often carry a higher risk of revenue loss.
PressureHumble3604@reddit (OP)
the developers that tend to be alarmist are the ones who spend hours discussing programming trivia with any serious impact on team or business performance. There are developers that are not like that and when they sound the alarm they are usually right.
Far-Policy5814@reddit
Big rewrites endup becoming giant messes themselves with half the features in the original system
Far-Policy5814@reddit
His viewpoint is not the worst. It's the most pragmatic and grounded in reality.
shifty_lifty_doodah@reddit
Don’t get me wrong I do care, but EVERY business I’ve worked for is a mess in some way. Working with other people’s systems is hard. That’s the reality
LittleLordFuckleroy1@reddit
The issue isn’t primitive infra, it’s the culture around not being able to propose and argue the benefits of implementing improvements.
Who is the most senior engineering staff member responsible for this product? When you discuss this with them, what do they say?
Have you been able to write up a proposal breaking down how much time is spent on debugging as compared to how much time it would take implement some improvement (log management or whatever) and what benefit it would have?
Sometimes there are valid business reasons for maintaining a status quo that’s less than efficient. If there’s not, and there’s no avenue to propose solutions, then yes you might want to consider moving on.
single_plum_floating@reddit
Well if you really want to you can work with your managers and find a specific niche and specific point to slowly and carefully unravel this mess and develop a nice system after many evenings.
You will be rewarded with a 5% wage increase and your entire team hating you because now everyone expects twice the work for the same money.
If you want to fix it. Then get a new job and come back as a consultant.
chrisrrawr@reddit
tinfoil hat, this kind of thing is and always has been part job protection and part alibi generator to keep things casual.
teams that CAN deliver more are expected to.
hibikir_40k@reddit
A lot of the things you mention are not always needed: You can deploy gigantic things without kubernetes or anything like that. Service discovery might be completely unnecessary if dns is set up reasonably.
Now, some observability is almost always nice, because it's important to know if the system is down: that's the place to start. But you are only going to get buy-in there if a lot of time is being spent on tasks that would be made trivial by observability. But if it's a mess, getting a daemon sitting on each instance, and just piping the logs elsewhere is not necessarily a huge lift, and then you can get wins, solely through improvements in outage resolution. Hell, before that, I bet you could script a "ssh to 20 machines at once and dig through the logs all at once" personal tooling. From there to "how about we actually colocate logs and host Kibana?" or something like that.
Far-Policy5814@reddit
Yes, no need for k8s and all that fancy service discovery stuff. Especially of you host on prem where server resources are plentiful.
EatDirty@reddit
Sounds like my previous company lol
FreeWilly1337@reddit
What does that cost to accomplish? What is the opportunity cost to accomplish it? What is the actual benefit? Are you just creating new issues to deal with?
ralf3001@reddit
the company is trying to stay lean to get acquired. watch out for tech debt and security issues. i’ve been in this situation and had to manage this company after being acquired. it was a shit show.
Dyledion@reddit
Don't go big. Don't do the big Kube. Start with the scripts. What exists, what still isn't a single invocation? Automate log pulling. Make your changes and improvements incrementally. You're trying to boil a frog, not pull a tooth.
gjionergqwebrlkbjg@reddit
It makes little sense to invest in something this crap. At a large company you will have standard solutions (in 5 different flavors) for log aggregation, you have platforms to run this software on, you will spend a ton of time getting marginal benefits from the crap instead of investing limited time you have to actually solve the problem.
Dyledion@reddit
Conversely, I would not want to spend a single drop of political goodwill on this crap. It's astonishingly easy to streamline things with judicious use of scripting. And, if you're careful and thoughtful about how you do it, a good bash script can transition into a painless dockerfile implementation, into a proper dev environment, into better CI, into better tests.
Do the small things that build towards the necessary things. Save your ranting and manager-bothering for stuff that steers the ship dramatically, like an actually critical refactor.
PressureHumble3604@reddit (OP)
We are talking about a system that is probably two orders of magnitude bigger than what you think
Dyledion@reddit
I've worked with systems where the main component out of a dozen was a million lines of business logic and million lines of tests. I know big.
PressureHumble3604@reddit (OP)
sure
Dyledion@reddit
Okay. You can keep vagueposting. Whatever dude. What I worked on is important, because it was an enormous, almost incomprehensible dumpster fire that was a hundred times better by the time I left, because of small, incremental changes. Leading by action around and between all the other responsibilities I had to make a difference that was unpopular, and forced the company culture to shift. I've done what you're trying to do.
You can continue to weep and wring your hands. I'm not going to stop you any more.
PressureHumble3604@reddit (OP)
let’s take for example log aggregation.
I will not be assigned time to do it. I will not be allowed to use third part tools.
Best I can do is to spend some hours in an half assed attempt that may never make it into prod. The tooling we have is a graveyard of half assed attempt at improving it by ex employees with limited time and resources and sometimes bad ideas that clearly were never discussed because the other/management never saw the issue.
unduly-noted@reddit
Do you guys track work at all? You might get buy-in if you present how much time is spent firefighting. Like if engineers are spending 90% of their day logging into machines and gropping logs, you can document that along with a few possible solutions. Like a few log management tools or something. And for each option, give a list of pros and cons for each and how/why it’s better than current approach.
I think if you have a well thought out proposal, with concrete steps and clearly articulate why it’s better and how day-to-day will improve, you might get buy-in.
Definitely need to start small though.
PressureHumble3604@reddit (OP)
unfortunately not, I proposed it as well but no one cared, from the IC to two level of management up.
_hephaestus@reddit
This is pretty common. Just think about it, if you could do it easily it’d be done, making it happen at scale requires a lot of core culture/process change that would still be disruptive even if you get it right the first time.
I’m not sure I view the situation as dramatic as you’re suggesting. I mean yes, 90% firefighting is awful, bad testing env needs fixing, and observability should be prioritized but am I supposed to be enraged by ssh-ing into machines and grepping logs? It’s not SotA and there’s better ways of fixing things, but if you’re constantly firefighting bugs in releases the fact that prod isn’t locked down seems like a first world problem.
Don’t boil the ocean. Find a low friction way to set up observability, get buy in on things everything agrees is a problem (i.e. services going down and not the infrastructure being primitive), make small changes, then argue for bigger ones.
PressureHumble3604@reddit (OP)
Some stuff it can be easily be done and will provide a reasonable improvement in the day to day life. it’s not being done. People don’t even see the issue. that’s the main problem.
kondorb@reddit
OK, you got at least two spot on answers already. I’ll just add little bit of perspective.
Your examples sound not that far from our current infra and we are totally fine with it. We don’t need fancy log aggregation, tracing, complex metrics, complicated load balancers, etc. It’s as simple as it gets by design because we are low on hands and we don’t want to maintain all that complexity for nothing. We’ve actually already been there before with this same project and intentionally simplified it by a ton. It would’ve been more expensive in the long run otherwise. (I admit we also aren’t running anything like microservices for the very same reason. It’s more like 3 big services, which used to be 5 smaller ones a couple years ago.)
Any complicated tool in your infra will have some associated maintenance costs plus a learning curve for your team. Which means it must save more in dev time or stability. For small teams that equation often comes out negative.
So yeah - talk to your team, identify the worst pain points, find the simplest possible fixes, get the team to be on board with you, implement, repeat until shit ain’t on fire anymore.
PressureHumble3604@reddit (OP)
Fair but we have two big differences:
we have a ton of microservices and we have a ton of outages, so it’s objectively not working for us.
Esseratecades@reddit
I've been there more than once.
Figure out what people's regular technical frustrations are. Are releases difficult? Is the product bug prone? Are there people who are regularly arguing about the same things over and over again?
Next, find one of these that has a solution that requires only small changes. Finally, and this is important, explain to your team how this solution solves problems that they have.
Implement the solution with their buy-in. Rinse and repeat until they're modern.
PressureHumble3604@reddit (OP)
that’s my first approach but common complaints are generic or require major refactoring. when I propose or I want to talk about smaller things that can still improve our day to day almost no one seem to care.
I have had few small improvements ignored for months. Another one landed and I was speechless of they could go for so many years without it.
It feels like we speak different languages. I’m an alien.
CodelinesNL@reddit
Yes.
thomas_grimjaw@reddit
I've been there twice already.
I was even hired the second time because I spearheaded a big refactor in the first company.
Then I got to do none of which I was basicaly hired to do.
My point is, it's about management and the team to make that happen not you personally against the tide.
Bonus tip:
If it's been like that for some time and through high turnover, then probably what you call "primitive infra" some teammates call "job security" and have a counter inscentive to make it better.
Far-Policy5814@reddit
Better get used to it. Things are the eay they are for a reason
AppealSame4367@reddit
Housekeeping, structuring your work, saving big it projects that are a mess: Always the same procedure. Start somewhere small with better structure and just keep working it in. Maybe next week you spend 1% less time with firefighting and so on.