Is there anything tangible you can do as an engineering team to improve another team’s poor upstream services?

Posted by kutjelul@reddit | ExperiencedDevs | View on Reddit | 31 comments

My project is basically a user-facing client that consumes about 20 upstream services. While integrating them, typically the quality of the services is very low. We’ve seen random 20s response times, invalid data being returned as valid, etc.

Occasionally we’re working on time sensitive integrations and even though the upstream service is claimed to be ready in time, that is rarely the case.

We cannot (as engineers) easily reach out to the teams involved, as our organization is spread out through too many layers, divisions, and locations. EMs sometimes have a point of contact, but in general those points of contact are also very slow to respond or might not understand the problems at all.

It might be biased, but it often feels like my team has their shit sort of together, and almost all of those other teams are just messing up time after time. Of course, I might not have the full picture of why these teams perform like this.

Anyway, what can I do about this, together with my team?

[-]

CooperNettees@reddit

the main way is to build around it however possible.

this might mean pulling all their data in the background, storing what you need in your own database tables, and then syncing changes upstream.

another common approach is building in retry systems, surfacing underlying service failures to users, and then having users complain about these underlying service issues; then you can point to the underlying service as being the problem as you've already built everything possible to "make it work as well as you otherwise can"

[-]

drew_eckhardt2@reddit

Request hedging and retries.

With request hedging you send an additional request after a short delay then take the first result which comes back.

[-]

dreamingwell@reddit

Measure and show.

[-]

kutjelul@reddit (OP)

To whom should I show this? Because our management and architect are aware, there’s just no action on it (because ‘good enough’). I have almost given them rants about (our) time not being free, but I figured that’s too emotional and not effective.

[-]

geeeffwhy@reddit

not even really devils advocate here: how do you know it’s not “good enough”?

we all have aesthetic preferences as craftspeople, and i have no doubt this situation offends you in that capacity, but … the business doesn’t care about that. the business cares about their goals (usually profit, ultimately, but much more complex in practice) and if fixing the upstream services won’t move the needle on those goals then this is very much a you problem.

if, on the other hand, you’re able to demonstrate a correlation between poor service performance san negative impact on known goals, this conversation is easier to have.

[-]

dreamingwell@reddit

You have to show them measurements that they care out. Revenue, customer churn, and operating costs are a good place to start. If you can’t connect those dots, then wasted employee time might be worth trying - but also they may not care, as the work is getting done and they may be happy with the current payroll amount.

As a last resort, tell them “we’re going to start losing employees, and it will be hard to find replacements willing to accept this kind of work”.

[-]

Nosferatatron@reddit

Agree. No commentary necessary, just provide metrics. Could even set up a dashboard with performance stats

[-]

dolcemortem@reddit

Tie the poor performance to customer facing metrics and you’ll get business attention quickly.

[-]

morosis1982@reddit

Metrics is about it, shown to the business people higher up and how it affects customer interaction.

That or doing their job better than them and ninja replacing it with your own. That one is not popular though, have only done it once. Our version was so much better that it worked out but politically it was a bit of a hot topic until our cost effectiveness and integration capability far outstripped the original.

If it is at all possible you could provide a caching layer. There is another service that we consume that is quietly being cached in another part of the business and changed from the single source of truth into just one of many sources, primarily because the service was slow and unreliable. We have a feed that is supposed to have sub 10s delay, but our metrics put it at an average of 15s with high variability up to hours.

If you call their API it is reliably several seconds and up to 30s or worse. Which in an org that uses AWS apigw and lambda in a significant capacity is a deal breaker as there is a hard 30s timeout.

[-]

Grabowskyi@reddit

What we did in similar situation is cache everything we could PROACTIVELY. It’s only works for reads and you not always know the full set of FKs you’re going to be using. But when you do, proactive caching helps a lot. We have a background job that constantly refreshes the cache.

[-]

roodammy44@reddit

At Amazon the philosophies were that teams should speak to one another through http. Proper service oriented architecture. Part of the reason this succeeded was because of proper service contracts, more of the reason it worked was that management would come down hard on services that did not perform.

Also, the idea was that each service would be competing against other services. The bad ones would die.

I would say the reason your services are failing is because you have lazy, complacent management (if you have already told management about the bad services, that is).

[-]

kutjelul@reddit (OP)

Yeah, that makes sense. Actually I’m painfully aware of management’s shortcomings. But yeah, that’s why I wonder if it’s even possible to improve this from the engineering side

[-]

DeterminedQuokka@reddit

Assume good intent. If people aren’t responding it’s not because their thing is bad it’s because your thing is probably not their number one priority
When you ask for something be really clear. Provide as much as you can. If you hand someone an api contract you cut through a lot of the planning and back forth of requirements gathering.
Offer to help. Ask if someone can Pr into their codebase from your team with their oversight.

[-]

ZukowskiHardware@reddit

Just make sure your kitchen is clean and everything is clearly documented. I had an upstream team complain that our api wasn’t working when they were trying to store a uuid we were returning to them in a db field that was too small. Our api was fully documented. Took them like a week to figure it out and fix it even though I showed them their own error message. Just put up clear walls and boundaries.

[-]

pl487@reddit

Don't accept that you can't talk to them. Figure out who they are and open communications. Do it when there isn't anything on fire. Drill through those layers to the person who can actually do something.

[-]

Silver_Bid_1174@reddit

First off, make sure you're practicing defensive coding - exponential back off on retries, circuit breakers, etc. if you're getting the same data repeatedly, cache it, but don't clear your cache before you know you have good data (assuming that old data is better than nothing).

Then make sure you're logging everything and alerting on all missed SLAs and other issues. You want the failures to be as noisy as possible.

Escalate it as high as possible in the food chain with solid data about the issues and the business impact of those issues.

And double / triple check that it's the other team's fault and not your own.

Good luck

[-]

veryspicypickle@reddit

Don’t solve political problems with technical solutions

[-]

call_Back_Function@reddit

Make a test suite using some rest api tester or even a bunch of curls. Send it to the team cc some bosses and tell them you found some issues you would like addressed. State the impact and cost of the issue. If no response level up the org chart on the cc and ping for status.

You will eventually hit someone that wants to know why this has reached their inbox and has not been acted on.

[-]

_hephaestus@reddit

Start from business impact then work backwards. When the upstream service is not actually ready, how is the end user impacted? How is your team’s success impacted?

If you can clearly identify this hurting your team’s success, can you set requirements on upstream dependencies that would improve business impact?

[-]

BoBoBearDev@reddit

Each team should be responsible of a vertical slice of the overall front end. If your team manage the user manager frontend, you manage the backend as well. If other team is doing the backend for chat functionality, they manage the front as well. Basically fullstack developers.

If you are the team that didn't work on any of those. You just bundle them into one app. You just pass the bug ticket to the team that is responsible. You shouldn't care their quality because you are not the one managing the actual components.

[-]

PredictableChaos@reddit

What do you use for Observability? Make sure you're collecting all that data. Ideally you have metrics that show the response time and error rates for each of those endpoints. Even if you just have traces you can usually generate metrics from that data.

Then make a dashboard that highlights all of this. Find a way to make it very public if necessary.

[-]

Life-Principle-3771@reddit

What is your contract/SLA with them? If you don't have one demand one, then page them when they violate it.

[-]

flavius-as@reddit

Get a more direct and read only access to the data and roll your own specific logic if you need any.

[-]

gdvs@reddit

Collect data. Report the problems objectively and let someone else (mgmt) take responsibility. The complex structure is not your responsibility.

[-]

poipoipoi_2016@reddit

So ideally, you have open-source codebases + it's not a crazy crazy architectural redo, and you go fix their stuff and send over PRs and open the conversation with the reviewers. Second best case is you look up who is oncall and start back-channeling through them. Worst case, all comms go through management.

> We cannot (as engineers) easily reach out to the teams involved

> our organization is spread out through too many layers, divisions, and locations. EMs sometimes have a point of contact

I mean, it might go nowhere and black hole.

But if you're THAT big, a corporate directory if only for oncall escalations is a must must. I would consider having to escalate via that management chain to be a yellow flag at best, but at some point what is a manager for? Get them to start escalating until you find someone who does actually know what they're talking about.

Sell it to management (and you might have to use the skip-level for this) as "Critical OKR X is now blocked on Subteam Y who need to clarify Z" and get the skip-level involved if you have to. You will have to work to do the sell (particularly if management is as non-technical as it sounds) and manage up to make them spend time on it,

Trust me I'm in my first quasi-management role and "I lack technical context to accomplish business goals on which I am being judged and that one guy is over in the corner doing nothing, but it'd take even longer to pull context out of him" is the worst feeling in the world.

[-]

tr14l@reddit

Set monitoring alerts that CCs the CTO.

[-]

FutureSchool6510@reddit

Implement strict timeouts. Drop bad data on the floor. Let the failures show through to the consumers.

Add metrics around upstream response times and erroneous data. Make sure you also have metrics for successful calls and good data to compare against. Show the ratio of good to bad.

[-]

olddev-jobhunt@reddit

This is primarily a political problem, not a technical one. But... fixing it probably means escalating, and doing that effectively will require data. You can't just say "your service sucks." What you can say is something defined like "as soon as we hit 25TPS, your P95 latency jumps to >20s as you can see on this date."

With data, you can prove something to the rest of the org and that can get you the buy in. Ideally, you get an SLA in place from them. It could potentially be useful for you to start trying to meet an SLA for your own downstream teams. You'll miss it, of course, but if you're armed with data, you and your consumers can all talk to the upstream service together.

Just measure, and escalate. If the CEO isn't interested in fixing it, it must not be very important after all. But for all other cases, there'll be someone higher up you can push to.

[-]

08148694@reddit

This seems like an organisational problem that you’re not going to fix. If this is a problem it’s probably not an environment where change comes from below

You can try to reach out to management with logs and charts and evidence of the upstream teams incompetence, but it’ll probably fall on deaf ears

If there’s communication barriers between teams then they should be treated as mini b2b services with strict SLAs

I’d either push to pull down the barriers and actually talk to each other, or put these SLAs in place so the other teams have accountability for their services

[-]

hitanthrope@reddit

Honestly, the big hairy monster is the fairly destructive culture of your organisation. The proper answer to this question is, "you get somebody on a quick call, explain the problem and why it has impact and you figure out what needs to be done". Doesn't sound like you work in that kind of place. Many of us don't.

You do have the flipperoo option, which is less than optimal but might work in your case. You treat them as a supplier, with you being the customer. Take the separation seriously since it has been placed between you, and treat them as an outside organisation. Start coming up with SLAs (reasonable ones) and writing the contract tests to assure them. Ask for a support channel, somewhere to file bug reports, response time on high priority issues... the whole 9 yards.

*If* you have the kind of upper-mid / upper management who would say, "don't be ridiculous" to something like this, then it's probably just a defective situation. I've worked at places where departments inside the company invoice each other.... you can structure like this.

Does sound like something you'll need help from management on though. Depending on how all this stuff is structured.

[-]

kernel_task@reddit

20 seems like a nightmare. Do you really need all 20 services? Is there any way to replace some of them with stuff your team can build yourself? Any way to get more direct access to the data sources?