Dealing with a Team with primitive Infra that seems fine with it. Cultural Mismatch?

Posted by PressureHumble3604@reddit | ExperiencedDevs | View on Reddit | 49 comments

Some months ago for various reasons I joined this team, quite prestigious in the big company and with well above average engineers.

They are tackling a complex domain and they have been doing so for years with a microservice architecture.

That's fine.

Until I discovered they have very primitive infra and the microservices architecture is bloated and inefficient (with horrible horizontal scaling)

Some examples:
- Almost no orchestration (No Kurbernetes or similar)
- Extremely simple and hardcoded load balancing
- No tracing, no proper debugging other than console.log and pray the machine gods.
- Barely usable testing environments (no debugging there)
- No service discovery, if there is any I have yet to "discover" it
- Very limited metrics, it's hard to set up new ones and they are not precise
- No tool to manage logs

Now the reasons why the system is in this state is that some people many years ago fucked up and management doesn't seem to care that the system has frequent outages and the engineers spend 90% of the time firefighting.

At the same time there are so many small things that at the low level we can do to improve the day to day life, things that should have been done years ago.

The problem is that personal initiative is frowned upon, well partially, when it's not there is no guidance so it's just people on their own that don't coordinate.

While at the lower level we discuss the issues frequently in an informal and inefficient way ( yes the department communication is crap at every level), not everyone seems to view the situation as dramatic as it is.

The daily life of an employee is made mostly of ssh-ing into multiple production machines, grepping several logs and entering a rabbithole to investigate the daily outage, if we are lucky we can run some horrible shell scripts that may help us investigate of hotfix the issue.

Is this normal?

Because the people that work with me are quite smart and definitely the best one I have worked with but they seem to not have their priorities straight, not they can communicate properly

[-]

morosis1982@reddit

If the team has been around for a while it's possible they started with a monolith and migrated to microservice and just didn't know what they didn't know, didn't get to update other aspects and now don't care much.

I've mostly worked in r&d, building new systems, where you need to ship features, so have a strong history of minimising the time I need to spend on the day to day support so that I can build.

If they don't have a significant backlog of features, then the day to day of just investigating and fixing issues could be 'enough' for them.

Honestly, the thing we do in my team is just ask ourselves each sprint what is causing us pain? Create a story to tackle it and just get it done. If it's a big change then we break it down so that it doesn't significantly affect capacity.

Just start somewhere. If parsing logs is the issue, see if you can introduce a log ingest, maybe there's capacity to get something open source installed and ingesting the logs so that you can query in one place. Especially in distributed services, add a tracking token to events so that an event can easily be traced across the system.

If you want to get the team on board, start by asking what they like doing the least and plan to solve that.

[-]