No you have caused this..

Posted by djmykey@reddit | talesfromtechsupport | View on Reddit | 39 comments

I work at a rather large MNC and we have an MSP helping with the daily mundane tasks and taking care of incidents that come along.

Recently we had an instance were 90+ incidents got generated because the monitoring solution detected a drift of > 60s from the actual time.

Actors in this story:

$Me: Me

$Win: Windows Team member.

$Mon: Monitoring Team Member.

$NW: An Network Expert (Not MSP).

$DNS: Networking and DNS Team Member.

$AD: AD Team Member.

I was in another call before this whole issue started and could not drop off because that was another storm I did not want on my horizon. By the time I joined the call to discuss the time issue people on the call were already trying to figure things out.

$NW: I do not see any latency in the network going towards the AD servers or the NTP servers.

$Win: It's not a Windows issue. We need to figure out why the time shifted so much. $AD tell us why your time source shifted.

$AD: Our time source did not shift. In fact we depend on InfoBlox (DNS) as our time source. InfoBlox is the agreed Time source for this infra.

**murmur about getting $DNS to joing the call**

$DNS: Yes tell me.. you guys have a problem with time shifting?

$AD: Yes, and due to that we have lost one jump host / bastion host.

$DNS: Yeah, I'm looking at InfoBlox, we do not see that the time has shifted by any amount recently. You might have to take a look at your systems.

$AD and $Win: But we have got incidents, so many of them, you need to tell us why this happened..

At this point I am thinking what the hell is going on !!

In the mean time I take a look at the ticket queue and login to some servers that had a ticket raised for them about time shift. What I notice is the time has not shifted, neither has any of the servers been pointed to the same time source. I was pretty confused at this point.

$Me: Can someone get $Mon on this call?

$Win: On it.

$Mon: Hello team, tell me how can I assist?

$Me: So we have 90+ incidents which were raised and got cleared in our systems for time shifts. Can you tell us more about why this happened. Please remember, no one is blaming you or the systems you manage. We just need to figure out what happened and avoid it in the future thats all.

$Mon: Oh ok, so.. when the agent... **goes on the explain what happens in a normal situation**

$Me: Let me stop you right there. Sorry for doing this, but in the interest of time, we know how your monitoring agent works. What we need to figure out here is what exactly happened.

$Win: Your system had a time shift which is why so many incidents got raised.. take those tickets to your queue and resolve them.

$Mon: Do you have any evidence of this fact?

$Me: $Win, lets review what all we can see and then start asking anyone to take a look at their systems.

We go through all the evidence I lay out for them, quite a lot of servers got an incident generated for them and most of those servers were reporting to a specific Monitoring server. All of them went off at the same time. All of them seem to have come back on track at the same time. I would have loved it this happened but alas none of our systems are synchronized swimmers.

$Me: You see these things? Also none of the servers have an event which tells us that the system clock was corrected. I have a doubt that either your monitoring servers time was off or something else was off due to which the monitoring solution thought that the servers clock was off by >60s.

$Me: $Mon please investigate on this topic and do let us know if you find anything.

We disband the call and create a group chat to have further updates on this topic.

2 hrs later:

$Mon: Yeah we had restarted a service on the monitoring server because it had crashed. That has caused this whole fiasco.

$ME: You know what to do !!

Moral of the story: If you do not present evidence or a strong logic to put your suspicions forward, people will push back on any query you put forward.

[-]

Techn0ght@reddit

The response is always: "It's not [[my area]] and we didn't change anything".

4 hours later: "Yes, we had a Change, but it doesn't list it would do this".

Me: "You expect your Change to say it would make the servers unresponsive?".

[-]

djmykey@reddit (OP)

For some reason, or lets say hearing this so so many times, whenever anyone says at a drop of the hat,, our systems are fine... I get very angry. It sends me up the wall. I mean we obviously have an issue. Why can't you take a look now? or do you keep staring at your systems all day long?

[-]

jthomas9999@reddit

I am the networking infrastructure person. If ai had a dime for all the times the network was blamed because someone didn’t want to take the time to do some basic troubleshooting, I would be wealthy. I end ups spending too much time proving others wrong.

[-]

Icy_Conference9095@reddit

I'm L2 non networking support and learned very quickly with networking support how to prove it wasn't a network issue in my site. Networking is understaffed at my location so something that takes 5-10 minutes to check on their end may take 2-3 days before it gets done

Long story short, I don't mess with those guys unless I'm 100% confident it's actually a network issue.

[-]

Responsible-End7361@reddit

If you say it is me and I investigate and it isn't me, I wasted my time. If I refuse to look until you prove it is me (by testing everything on your end, wasting your time) then I only troubleshoot once I know it is me, saving my time.

(Their logic)

[-]

Techn0ght@reddit

Same. I swear if their monitoring is green they think everything is fine and won't look any deeper.

[-]

djmykey@reddit (OP)

I don't understand when will people understand that the health of your device depends on how deep you are monitoring. Obviously you cannot moniter every breath the device takes so there is a lot that remains to be monitored. But.. everything is green on our side. Its your problem.

[-]

Techn0ght@reddit

What's funny is no one accepts this from the network folks, but readily accept it from the systems and database folks.

[-]

Speciesunkn0wn@reddit

Probably because the network connects everything, so clearly the network is changing things to be wrong! (Even though the network is supposed to transmit everything exactly as it receives it...)

[-]

Techn0ght@reddit

There are some things that are intended to change traffic actually, like a content switching load balancer.

[-]

Speciesunkn0wn@reddit

Huh.

[-]

blue_canyon21@reddit

Why did this feel like a manhunt instead of a group of people trying to figure out what cause a self-resolved issue?

Seems like the whole call/meeting was a waste of time that just caused undue stress to multiple people.

Could have been an email saying, "Hey all, we had a number of incidents pop up saying [List of servers] had a timeshift of >60 seconds. The issue has since self-resolved but can each of you investigate your respective services and attempt to establish a root cause. I'll check back with all of you in X hours."

[-]

djmykey@reddit (OP)

This wasn't the last of it. They upgraded the monitoring solution. And it opened up a lot of different monitoring options which werent available in the older version. Shit hit the fan, literally. We got so so so many incidents it was a jungle out there. We had to raise a change to get them disable so many monitoring options to restore sanity

[-]

marshmallowcthulhu@reddit

Can you just stop monitoring so that monitoring doesn't cause incidents?

[-]

jandienal@reddit

You've got what it takes to succeed in middle management!

[-]

L0pkmnj@reddit

Ummm, I'm gonna need you to go ahead come in tomorrow. So if you could be here around 9 that would be great, mmmk...?

Oh oh! and I almost forgot ahh, I'm also gonna need you to go ahead and come in on Sunday too, kay. We ahh lost some people this week and ah, we sorta need to play catch up.

[-]

djmykey@reddit (OP)

10 points to Griffyndor !!!

[-]

Turbojelly@reddit

2 years ago we were fighting with a server that kept changing time. Sysadmin was adamant there was nothing wrong his end. I eventually did a full GPO export and found 3 different GPO's pointing to 3 different time sources.

[-]

hockeyak@reddit

Segal's Law - "A man with a watch knows what time it is. A man with two watches is never sure."

[-]

djmykey@reddit (OP)

whattay clusterfsck that would be.. damn.

[-]

Turbojelly@reddit

Currently on second week of trying to get the same sysadmin to inject the network driver of a new PC model to the build image. He keeps insisting the issue it at our end despite the mountain of evidence we are blasting him with.

[-]

Jonathan_the_Nerd@reddit

Time to pull out the clue-by-four?

[-]

djmykey@reddit (OP)

If you installed it on one PC and it solved the issue, you have more than sufficient evidence. If the sysadmin is objecting, he needs to be ejected from that role (not job).

[-]

newfor2023@reddit

I really thought you were going to say window.

[-]

pavelow007@reddit

Ahh, a BOFH connoisseur.

[-]

newfor2023@reddit

Defenestration over IP

[-]

Turbojelly@reddit

Literally getting nonIP.address and an error saying "no network driver found" when imaging off a USB and pxe was hanging to a restart.

All fixed now.

[-]

mercurygreen@reddit

Yeah, I've seen them pulling time from A-B-C-A in a ring.

I've also seen a junior deciding that the Virtual machines should all pull time from the physical box they were on... without bothering to check what the time was on that physical box (HINT the senior guy had decided it didn't matter on the physical box. AND NOW THERE'S A PAIR OF NEW POLICIES IN PLACE...)

[-]

Stryker_One@reddit

You know what to do !!

The needful?

[-]

Impossible_IT@reddit

Kindly

[-]

ManufacturerWitty700@reddit

The hokie pokie?

[-]

djmykey@reddit (OP)

as always 😏😏

[-]

JerseySommer@reddit

It helps when you sing it to "do the hustle"

DO THE NEEDFUL! 🎵🎶Doo doo doo doot doo doo doo doot doo doo, doot doot doot doo doo!🎶

[-]

Xaphios@reddit

Just make sure you also Revert.

[-]

S34d0g@reddit

Back to the original?

[-]

kreeghor@reddit

Kindly do the needful

[-]

ghostlee13@reddit

Wow, Cricket Liu who wrote the book on DNS and BIND!

[-]

Comfortable-Scale132@reddit

I've actually seen this occur in Nagios and everything was in the Log regarding timeshift. Not sure what caused it. It only happened once.

[-]

ServoIIV@reddit

We had some systems we were supporting that were extremely time sensitive. Systems in multiple locations in multiple countries had to be within a certain amount of microseconds. Because of this we used GPS transceivers as a time source. The internal clocks in the equipment were very precise and would stay within acceptable time margin for up to 6 months even if the GPS was not functioning. We got a call that we had a broken part in the system, which was not unusual in the environment we were operating in. We sent down a spare, but the on site tech said that one was also broken. The parts had been tested before sending it, but we sent another one. The tech reported that one was broken as well. I ended up getting flown in to a very remote location to take a look at the system. it turned out there were two different signal splitters to feed time to different parts of the equipment depending on which modules were being used. At some point in the last 6 months a different tech had replaced some parts and hooked the GPS timing device up to the wrong splitter, and it wasn't connected to the modem at all, but it had eventually drifted enough to stop working. All the spares were fine but couldn't get synchronized between the two devices on each end of the connection, so they threw errors and wouldn't initialize. At least I got an interesting trip out of it.