No you have caused this..

Posted by djmykey@reddit | talesfromtechsupport | View on Reddit | 39 comments

I work at a rather large MNC and we have an MSP helping with the daily mundane tasks and taking care of incidents that come along.

Recently we had an instance were 90+ incidents got generated because the monitoring solution detected a drift of > 60s from the actual time.

Actors in this story:

$Me: Me

$Win: Windows Team member.

$Mon: Monitoring Team Member.

$NW: An Network Expert (Not MSP).

$DNS: Networking and DNS Team Member.

$AD: AD Team Member.

I was in another call before this whole issue started and could not drop off because that was another storm I did not want on my horizon. By the time I joined the call to discuss the time issue people on the call were already trying to figure things out.

$NW: I do not see any latency in the network going towards the AD servers or the NTP servers.

$Win: It's not a Windows issue. We need to figure out why the time shifted so much. $AD tell us why your time source shifted.

$AD: Our time source did not shift. In fact we depend on InfoBlox (DNS) as our time source. InfoBlox is the agreed Time source for this infra.

**murmur about getting $DNS to joing the call**

$DNS: Yes tell me.. you guys have a problem with time shifting?

$AD: Yes, and due to that we have lost one jump host / bastion host.

$DNS: Yeah, I'm looking at InfoBlox, we do not see that the time has shifted by any amount recently. You might have to take a look at your systems.

$AD and $Win: But we have got incidents, so many of them, you need to tell us why this happened..

At this point I am thinking what the hell is going on !!

In the mean time I take a look at the ticket queue and login to some servers that had a ticket raised for them about time shift. What I notice is the time has not shifted, neither has any of the servers been pointed to the same time source. I was pretty confused at this point.

$Me: Can someone get $Mon on this call?

$Win: On it.

$Mon: Hello team, tell me how can I assist?

$Me: So we have 90+ incidents which were raised and got cleared in our systems for time shifts. Can you tell us more about why this happened. Please remember, no one is blaming you or the systems you manage. We just need to figure out what happened and avoid it in the future thats all.

$Mon: Oh ok, so.. when the agent... **goes on the explain what happens in a normal situation**

$Me: Let me stop you right there. Sorry for doing this, but in the interest of time, we know how your monitoring agent works. What we need to figure out here is what exactly happened.

$Win: Your system had a time shift which is why so many incidents got raised.. take those tickets to your queue and resolve them.

$Mon: Do you have any evidence of this fact?

$Me: $Win, lets review what all we can see and then start asking anyone to take a look at their systems.

We go through all the evidence I lay out for them, quite a lot of servers got an incident generated for them and most of those servers were reporting to a specific Monitoring server. All of them went off at the same time. All of them seem to have come back on track at the same time. I would have loved it this happened but alas none of our systems are synchronized swimmers.

$Me: You see these things? Also none of the servers have an event which tells us that the system clock was corrected. I have a doubt that either your monitoring servers time was off or something else was off due to which the monitoring solution thought that the servers clock was off by >60s.

$Me: $Mon please investigate on this topic and do let us know if you find anything.

We disband the call and create a group chat to have further updates on this topic.

2 hrs later:

$Mon: Yeah we had restarted a service on the monitoring server because it had crashed. That has caused this whole fiasco.

$ME: You know what to do !!

Moral of the story: If you do not present evidence or a strong logic to put your suspicions forward, people will push back on any query you put forward.